problem with Quantization convertion. "SOLVED"

#15

by josef2600 - opened Mar 13

Mar 13

•

for anyone who has problem too!
i haven't loaded them, but i was having problem converting them to q8 or anything via "llama.cpp". it would gave me error:
"INFO:hf-to-gguf:Loading model: gemma-3-27b-it ERROR:hf-to-gguf:Model Gemma3ForConditionalGeneration is not supported"
i updated them around 15 hours ago. but i fined i have to do this too:
pip install git+https://github.com/huggingface/[email protected]
after that, i just updated everything else too, including llama.cpp .
pip install --upgrade huggingface-hub
pip install --upgrade datasets huggingface-hub
pip install numpy pandas
pip install --upgrade datasets transformers huggingface-hub
python -m venv venv
venv\Scripts\activate

then it start the conversion (Quantization !). like this, for q8_0:

python convert_hf_to_gguf.py "D:\AI\gemma-3-27b-it" --outfile "C:\ai\llama.cpp\new_model\new.gguf" --outtype q8_0

hope it was helpful.

Renu11

Google org Mar 13

Thanks for the confirmation @josef2600 .

josef2600

Mar 13

@Renu11 , thank you.
also, i now confirm that it worked! right now i am working with it to code with Arduino. i am using converted to 8bit (q8) and if i do put the correct and clear instructions, it does a good job at coding, at least so far for testing!although it does hallucinate a bit, for my specific codes, i think because its database is older than 1 month for the codes! since it wasnt properly converted into Arduino but it was in espressif library's.
also, a big thanks to google and everybody who was and is involved in this project,
and also who ever else is helping everyone for free.

nispa

Mar 14

This comment has been hidden (marked as Resolved)

deathknight0

Aug 17

•

edited Aug 17

I'm running into issues using BitsAndBytes for quantization. I keep getting this cryptic CUDA error:
nf4_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16 ) model = Gemma3ForConditionalGeneration.from_pretrained( model_dir, quantization_config = nf4_config ).eval()


..... (rest of code)
output = model.generate(**inputs, max_new_tokens=100)

Error: output = model.generate(**inputs, max_new_tokens=100) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

When I don't use BitsAndBytes (I can't actually do this to practically use the model as I only have a single RTX 3090; I just did it for debugging), I get this (presumably) CUDA error:
model = Gemma3ForConditionalGeneration.from_pretrained( model_dir, device_map='auto', torch_dtype=torch.bfloat16 ).eval() .... output = model.generate(**inputs, max_new_tokens=100)

Error: output = model.generate(**inputs, max_new_tokens=100) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: cutlassF: no kernel found to launch!

Libraries :
torch 2.6.0+cu124
transformers 4.55.2
triton-windows 3.4.0.post20
accelerate 1.10.0
bitsandbytes 0.47.0

Is there a minimum torch/CUDA requirement to use this model? I'm running CUDA 12.4.

Thanks in advance!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment