problem with Quantization convertion. "SOLVED"
for anyone who has problem too!
i haven't loaded them, but i was having problem converting them to q8 or anything via "llama.cpp". it would gave me error:
"INFO:hf-to-gguf:Loading model: gemma-3-27b-it ERROR:hf-to-gguf:Model Gemma3ForConditionalGeneration is not supported"
i updated them around 15 hours ago. but i fined i have to do this too:
pip install git+https://github.com/huggingface/[email protected]
after that, i just updated everything else too, including llama.cpp .
pip install --upgrade huggingface-hub
pip install --upgrade datasets huggingface-hub
pip install numpy pandas
pip install --upgrade datasets transformers huggingface-hub
python -m venv venv
venv\Scripts\activate
then it start the conversion (Quantization !). like this, for q8_0:
python convert_hf_to_gguf.py "D:\AI\gemma-3-27b-it" --outfile "C:\ai\llama.cpp\new_model\new.gguf" --outtype q8_0
hope it was helpful.
@Renu11
, thank you.
also, i now confirm that it worked! right now i am working with it to code with Arduino. i am using converted to 8bit (q8) and if i do put the correct and clear instructions, it does a good job at coding, at least so far for testing!although it does hallucinate a bit, for my specific codes, i think because its database is older than 1 month for the codes! since it wasnt properly converted into Arduino but it was in espressif library's.
also, a big thanks to google and everybody who was and is involved in this project,
and also who ever else is helping everyone for free.
I'm running into issues using BitsAndBytes for quantization. I keep getting this cryptic CUDA error:
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = Gemma3ForConditionalGeneration.from_pretrained(
model_dir, quantization_config = nf4_config
).eval()
..... (rest of code)
output = model.generate(**inputs, max_new_tokens=100)
Error:
output = model.generate(**inputs, max_new_tokens=100)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
When I don't use BitsAndBytes (I can't actually do this to practically use the model as I only have a single RTX 3090; I just did it for debugging), I get this (presumably) CUDA error:
model = Gemma3ForConditionalGeneration.from_pretrained(
model_dir, device_map='auto', torch_dtype=torch.bfloat16
).eval()
....
output = model.generate(**inputs, max_new_tokens=100)
Error:
output = model.generate(**inputs, max_new_tokens=100)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: cutlassF: no kernel found to launch!
Libraries :
torch 2.6.0+cu124
transformers 4.55.2
triton-windows 3.4.0.post20
accelerate 1.10.0
bitsandbytes 0.47.0
Is there a minimum torch/CUDA requirement to use this model? I'm running CUDA 12.4.
Thanks in advance!