Zing Forum

Reading

FACT: Composable CUDA Kernel Synthesis via a Three-Stage Agent Workflow

The FACT framework uses a three-stage workflow of pattern discovery, pattern implementation, and pattern composition, leveraging LLM agents to automatically convert PyTorch modules into optimized CUTLASS kernels, achieving a 2.79x end-to-end speedup on MiniGPT blocks.

CUDA kernel synthesisCUTLASSLLM agentGPU optimizationkernel fusionPyTorchauto-tuningdeep learning compiler
Published 2026-04-29 21:29Recent activity 2026-04-30 10:53Estimated read 5 min
FACT: Composable CUDA Kernel Synthesis via a Three-Stage Agent Workflow
1

Section 01

Core Introduction to the FACT Framework

The FACT (Framework for Agentic CUTLASS Transpilation) framework uses a three-stage agent workflow of pattern discovery, pattern implementation, and pattern composition to guide LLMs in using existing CUTLASS components for compositional optimization, automatically converting PyTorch modules into optimized CUTLASS kernels and achieving a 2.79x end-to-end speedup on MiniGPT blocks. This framework aims to address the limitations of deep learning compiler optimization and the problem of redundant reinvention in pure LLM code generation.

2

Section 02

Background and Challenges of Deep Learning Optimization

Modern deep learning frameworks rely on low-level libraries like cuBLAS and cuDNN, but optimization patterns are limited by manually written catalogs by engineers. When facing uncovered operator combinations or special shapes, developers either accept suboptimal performance or need deep GPU knowledge to hand-write CUDA/CUTLASS code. Recent methods of pure LLM-generated CUDA kernels have the problem of repeatedly "rediscovering" optimization techniques from mature libraries, resulting in low efficiency and insufficient code robustness.

3

Section 03

Detailed Explanation of FACT's Three-Stage Agent Workflow

FACT's three-stage workflow is as follows:

  1. Pattern Discovery: Track the computation graph of PyTorch modules, LLM agents match predefined optimization rules, query architecture-specific index libraries, and output prioritized optimization patterns.
  2. Pattern Implementation: Generate CUTLASS kernels and wrap them as PyTorch custom operators, including template instantiation, parameter inference, automatic tuning (searching for optimal configurations), and correctness verification.
  3. Pattern Composition: Combine independently optimized kernels into a complete module, maintain data flow connections, and perform end-to-end benchmarking.
4

Section 04

Performance Evaluation and Comparison of FACT

Experimental Results:

  • On NVIDIA A100, basic GEMM workloads (square matrices, batches, large K-dimension multiplications) achieved a 1.06-1.18x speedup (compared to the cuBLAS baseline).
  • The MiniGPT Transformer block achieved a 2.79x end-to-end speedup by fusing multi-head attention and MLP GEMM+GELU. Comparison with Pure LLM Generation: FACT leverages mature CUTLASS components, correctness is based on verification libraries, performance can be automatically tuned, maintainability is high, and the development threshold is low.
5

Section 05

Significance, Limitations, and Future Directions of FACT

Significance: Reduces the threshold for custom kernel development, accelerates the deployment of new model architectures, and forms a complementary strategy with deep learning compilers. Limitations: Relies on CUTLASS and only supports NVIDIA GPUs; the search space for automatic tuning of complex fusion patterns is large; compilation time is long. Future Directions: Expand to platforms like AMD ROCm and Intel oneAPI; introduce ML-guided search strategies to accelerate tuning; explore online learning for continuous optimization.