Chat Template Issue with llama.cpp
Getting an error in llama.cpp with the chat template parsing, I'll try to play with it passing in a manual jinja file. See details for logs. Thanks!
Expected closing block tag at row 216, column 3:
π Details
I managed to convert the bf16 safetensors to bf16 GGUF like so:
numactl -N 1 -m 1 \
python \
convert_hf_to_gguf.py \
--outtype bf16 \
--split-max-size 50G \
--outfile /mnt/data/models/ubergarm/GigaChat3-702B-A36B-preview-GGUF \
/mnt/data/models/ai-sage/GigaChat3-702B-A36B-preview-bf16/
I then managed to quantize it to Q8_0 pure using ik_llama.cpp like so:
./build/bin/llama-quantize \
--pure \
/mnt/data/models/ubergarm/GigaChat3-702B-A36B-preview-GGUF/GigaChat3-702B-A36B-preview-BF16-00001-of-00031.gguf \
/mnt/data/models/ubergarm/GigaChat3-702B-A36B-preview-GGUF/GigaChat3-702B-A36B-preview-Q8_0.gguf \
Q8_0 \
128
It starts up in llama.cpp okay, but throws an error with the chat template:
$ export model=/mnt/data/models/ubergarm/GigaChat3-702B-A36B-preview-GGUF/GigaChat3-702B-A36B-preview-Q8_0.gguf
$ export SOCKET=1
numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-server \
--model "$model"\
--alias ubergarm/GigaChat3-702B-A36B-preview-GGUF \
--ctx-size 65536 \
-ctk q8_0 \
-ub 4096 -b 4096 \
--parallel 1 \
--threads 96 \
--threads-batch 128 \
--numa numactl \
--host 127.0.0.1 \
--port 8080 \
--no-mmap \
--jinja
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = GigaChat3 702B A36B Preview Bf16
llama_model_loader: - kv 3: general.finetune str = preview
llama_model_loader: - kv 4: general.basename str = GigaChat3
llama_model_loader: - kv 5: general.size_label str = 702B-A36B
.
.
.
llama_model_loader: - kv 49: tokenizer.chat_template str = {#--------TOOL RENDERING FUNCTIONS---...
llama_model_loader: - type f32: 379 tensors
llama_model_loader: - type q8_0: 761 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 695.00 GiB (8.50 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 2 ('</s>')
load: special tokens cache size = 14
load: token to piece cache size = 1.0295 MB
print_info: arch = deepseek2
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 7168
print_info: n_embd_inp = 7168
.
.
.
load_tensors: CPU model buffer size = 711682.93 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 65536
llama_context: n_ctx_seq = 65536
llama_context: n_batch = 4096
llama_context: n_ubatch = 4096
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 100000.0
llama_context: freq_scale = 0.025
llama_context: n_ctx_seq (65536) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 1.96 MiB
llama_kv_cache: CPU KV buffer size = 6544.00 MiB
llama_kv_cache: size = 6544.00 MiB ( 65536 cells, 64 layers, 4/1 seqs), K (q8_0): 2448.00 MiB, V (f16): 4096.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: CPU compute buffer size = 2616.11 MiB
llama_context: graph nodes = 5940
llama_context: graph splits = 1
common_init_from_params: added </s> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 65536
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
common_chat_templates_init: failed to parse chat template (defaulting to chatml): Expected closing block tag at row 216, column 3:
{%- set DEVSYSTEM =
"""<role_description>
^
Description of the roles available in the dialog.
I just deleted the entire {%- set DEVSYSTEM =... block which seems to fix it. Maybe because you mention <role_description> twice and it is not escaped towards the bottom of that section?
So the Q8_0 seems to be working with llama.cpp... doing more testing...
Got the little one going with a small PR: https://huggingface.co/ubergarm/GigaChat3-10B-A1.8B-GGUF/tree/main
Ok