CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
Abstract
CUDA-L2, a system combining large language models and reinforcement learning, optimizes Half-precision General Matrix Multiply CUDA kernels, achieving significant speedups over existing baselines in both offline and server modes.
In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it torch.matmul} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2
Community
Introducing CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning.
🔹Using RL, CUDA-L2 optimizes HGEMM kernels across 1,000 MxNxK configurations. It outperforms major matmul baselines to date.
🔹In offline mode, it yields +22.0% over torch.matmul, +19.2% over cuBLAS, +16.8% over cuBLASLt-heuristic, and +11.4% over the most competitive cuBLASLt-AutoTuning.
🔹In server mode, the speedups further increase to +28.7%, +26.0%, +22.4%, and +15.9% for torch.matmul, cuBLAS, cuBLASLt-heuristic, and cuBLASLt-AutoTuning respectively.
🧐CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit (2025)
- Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch (2025)
- ConCuR: Conciseness Makes State-of-the-Art Kernel Generation (2025)
- PerfDojo: Automated ML Library Generation for Heterogeneous Architectures (2025)
- TritonRL: Training LLMs to Think and Code Triton Without Cheating (2025)
- STARK: Strategic Team of Agents for Refining Kernels (2025)
- From Projection to Prediction: Beyond Logits for Scalable Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper