RTX 5000 SeriesโReady llama-cpp-python Wheel (Python 3.12, Windows)
Status: โ
CONFIRMED WORKING โ No more โinvalid resource handleโ errors
Wheel: llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl
License: MIT (same as upstream llama-cpp-python)
Platform: Windows 10/11 x64
Python: 3.12
CUDA: 12.8 (optimized for Blackwell)
๐ Performance (Verified on RTX 5090)
- ~64 tokens/sec on Mistral Small 24B (5-bit quant)
- Full GPU offload (
n_gpu_layers = -1) working as expected - ~1.83ร faster than RTX 3090 in the same setup (35 tok/s โ 64 tok/s)
- 32 GB VRAM fully utilized (no kernel crashes)
Notes: numbers vary with quant, context, and params; these are representative.
๐ง Why This Works
The wheel forces cuBLAS instead of ggmlโs custom CUDA kernels.
On RTX 5090 (Blackwell, sm_120), ggmlโs custom kernels can trigger:
โCUDA error: invalid resource handleโ.
cuBLAS is stable on 5090 and avoids those kernel issues.
Key CMake flags used: -DGGML_CUDA=ON -DGGML_CUDA_FORCE_CUBLAS=1 # Use cuBLAS instead of custom kernels -DGGML_CUDA_NO_PINNED=1 # Avoid pinned memory issues with GDDR7 -DGGML_CUDA_F16=0 # Disable problematic FP16 code paths -DCMAKE_CUDA_ARCHITECTURES=all-major # Ensure sm_120 is included
๐ Requirements
- NVIDIA RTX 5090 (or other Blackwell GPU)
- NVIDIA drivers 570.86.10+
- CUDA Toolkit 12.8
- Python 3.12
- Windows 10/11 x64
- Microsoft Visual C++ Redistributable 2015โ2022
๐ ๏ธ Installation
Download the wheel:
llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whlInstall: pip install llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl
โ Quick Verification
from llama_cpp import Llama
# Full GPU offload on 5090
llm = Llama(
model_path="your_model.gguf",
n_gpu_layers=-1, # full GPU
n_ctx=2048,
verbose=True
)
out = llm("Hello, how are you?", max_tokens=20)
print(out["choices"][0]["text"])
What to look for in stdout:
- CUDA device assignment lines (e.g., using CUDA:0)
- VRAM allocations without any โinvalid resource handleโ errors
๐๏ธ Build It Yourself (Advanced)
Prereqs: CUDA 12.8, Visual Studio Build Tools 2022 (with C++), Python 3.12
mkdir C:\wheels
cd C:\wheels
set FORCE_CMAKE=1
set CMAKE_BUILD_PARALLEL_LEVEL=15
set CMAKE_ARGS=-DGGML_CUDA=ON -DGGML_CUDA_FORCE_CUBLAS=1 -DGGML_CUDA_NO_PINNED=1 -DGGML_CUDA_F16=0 -DCMAKE_CUDA_ARCHITECTURES=all-major
pip wheel llama-cpp-python --no-cache-dir --wheel-dir C:\wheels --verbose
Build time: ~10 minutes on a modern CPU
Wheel size: ~231 MB (larger due to cuBLAS inclusion)
๐ Troubleshooting
โInvalid resource handleโ errors
- This wheel specifically fixes this. If you still see them, verify:
- CUDA 12.8 is installed
- Latest NVIDIA drivers are installed
- No other CUDA apps are interfering
CPU fallback
- If GPU isnโt detected, check
nvidia-smiand ensureCUDA_VISIBLE_DEVICESisnโt set.
๐ Credits
Built using the open-source llama-cpp-python project by abetlen and the llama.cpp project by ggml-org.
This wheel provides RTX 5090 compatibility by configuring cuBLAS fallback; it is not an official upstream release.
- For issues with this specific wheel: open an issue here (this repo/thread).
- For general
llama-cpp-pythonissues: use the official repository.
Finally โ RTX 5000 series owners can use their flagship GPU for local LLM inference without crashes! ๐