Papers
arxiv:2512.02551

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Published on Dec 2
· Submitted by Xiaoya Li on Dec 3
Authors:
,
,
,
,
,

Abstract

CUDA-L2, a system combining large language models and reinforcement learning, optimizes Half-precision General Matrix Multiply CUDA kernels, achieving significant speedups over existing baselines in both offline and server modes.

AI-generated summary

In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it torch.matmul} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2

Community

Paper submitter

Introducing CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning.

🔹Using RL, CUDA-L2 optimizes HGEMM kernels across 1,000 MxNxK configurations. It outperforms major matmul baselines to date.
🔹In offline mode, it yields +22.0% over torch.matmul, +19.2% over cuBLAS, +16.8% over cuBLASLt-heuristic, and +11.4% over the most competitive cuBLASLt-AutoTuning.
🔹In server mode, the speedups further increase to +28.7%, +26.0%, +22.4%, and +15.9% for torch.matmul, cuBLAS, cuBLASLt-heuristic, and cuBLASLt-AutoTuning respectively.

🧐CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.02551 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.02551 in a Space README.md to link it from this page.

Collections including this paper 1