RTX 5000 Seriesโ€“Ready llama-cpp-python Wheel (Python 3.12, Windows)

Status: โœ… CONFIRMED WORKING โ€” No more โ€œinvalid resource handleโ€ errors
Wheel: llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl
License: MIT (same as upstream llama-cpp-python)

Platform: Windows 10/11 x64
Python: 3.12
CUDA: 12.8 (optimized for Blackwell)


๐Ÿš€ Performance (Verified on RTX 5090)

  • ~64 tokens/sec on Mistral Small 24B (5-bit quant)
  • Full GPU offload (n_gpu_layers = -1) working as expected
  • ~1.83ร— faster than RTX 3090 in the same setup (35 tok/s โ†’ 64 tok/s)
  • 32 GB VRAM fully utilized (no kernel crashes)

Notes: numbers vary with quant, context, and params; these are representative.


๐Ÿ”ง Why This Works

The wheel forces cuBLAS instead of ggmlโ€™s custom CUDA kernels.
On RTX 5090 (Blackwell, sm_120), ggmlโ€™s custom kernels can trigger: โ€œCUDA error: invalid resource handleโ€.

cuBLAS is stable on 5090 and avoids those kernel issues.

Key CMake flags used: -DGGML_CUDA=ON -DGGML_CUDA_FORCE_CUBLAS=1 # Use cuBLAS instead of custom kernels -DGGML_CUDA_NO_PINNED=1 # Avoid pinned memory issues with GDDR7 -DGGML_CUDA_F16=0 # Disable problematic FP16 code paths -DCMAKE_CUDA_ARCHITECTURES=all-major # Ensure sm_120 is included


๐Ÿ“‹ Requirements

  • NVIDIA RTX 5090 (or other Blackwell GPU)
  • NVIDIA drivers 570.86.10+
  • CUDA Toolkit 12.8
  • Python 3.12
  • Windows 10/11 x64
  • Microsoft Visual C++ Redistributable 2015โ€“2022

๐Ÿ› ๏ธ Installation

  1. Download the wheel:
    llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl

  2. Install: pip install llama_cpp_python-0.3.16-cp312-cp312-win_amd64.whl


โœ… Quick Verification

from llama_cpp import Llama

# Full GPU offload on 5090
llm = Llama(
    model_path="your_model.gguf",
    n_gpu_layers=-1,   # full GPU
    n_ctx=2048,
    verbose=True
)

out = llm("Hello, how are you?", max_tokens=20)
print(out["choices"][0]["text"])

What to look for in stdout:

  • CUDA device assignment lines (e.g., using CUDA:0)
  • VRAM allocations without any โ€œinvalid resource handleโ€ errors

๐Ÿ—๏ธ Build It Yourself (Advanced)

Prereqs: CUDA 12.8, Visual Studio Build Tools 2022 (with C++), Python 3.12

mkdir C:\wheels
cd C:\wheels

set FORCE_CMAKE=1
set CMAKE_BUILD_PARALLEL_LEVEL=15
set CMAKE_ARGS=-DGGML_CUDA=ON -DGGML_CUDA_FORCE_CUBLAS=1 -DGGML_CUDA_NO_PINNED=1 -DGGML_CUDA_F16=0 -DCMAKE_CUDA_ARCHITECTURES=all-major

pip wheel llama-cpp-python --no-cache-dir --wheel-dir C:\wheels --verbose

Build time: ~10 minutes on a modern CPU
Wheel size: ~231 MB (larger due to cuBLAS inclusion)


๐Ÿ› Troubleshooting

โ€œInvalid resource handleโ€ errors

  • This wheel specifically fixes this. If you still see them, verify:
    • CUDA 12.8 is installed
    • Latest NVIDIA drivers are installed
    • No other CUDA apps are interfering

CPU fallback

  • If GPU isnโ€™t detected, check nvidia-smi and ensure CUDA_VISIBLE_DEVICES isnโ€™t set.

๐Ÿ™ Credits

Built using the open-source llama-cpp-python project by abetlen and the llama.cpp project by ggml-org.
This wheel provides RTX 5090 compatibility by configuring cuBLAS fallback; it is not an official upstream release.

  • For issues with this specific wheel: open an issue here (this repo/thread).
  • For general llama-cpp-python issues: use the official repository.

Finally โ€” RTX 5000 series owners can use their flagship GPU for local LLM inference without crashes! ๐ŸŽ‰

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support