AI-Generated CUDA Kernels Outperform PyTorch?

2025-05-30

Researchers used large language models and a novel branching search strategy to automatically generate pure CUDA-C kernels without relying on libraries like CUTLASS or Triton. Surprisingly, these AI-generated kernels in some cases outperform even expert-optimized production kernels in PyTorch, achieving nearly 2x speedup on Conv2D. The method leverages natural language reasoning about optimization strategies and a branching search to explore multiple hypotheses in parallel, effectively avoiding local optima. While FP16 matrix multiplication and Flash Attention performance still needs improvement, this research opens a new frontier in high-performance kernel autogeneration, hinting at the immense potential of AI in compiler optimization.