Chat Template Issue with llama.cpp

#1
by ubergarm - opened

Getting an error in llama.cpp with the chat template parsing, I'll try to play with it passing in a manual jinja file. See details for logs. Thanks!

Expected closing block tag at row 216, column 3:

πŸ‘ˆ Details

I managed to convert the bf16 safetensors to bf16 GGUF like so:

numactl -N 1 -m 1 \
python \
    convert_hf_to_gguf.py \
    --outtype bf16 \
    --split-max-size 50G \
    --outfile /mnt/data/models/ubergarm/GigaChat3-702B-A36B-preview-GGUF \
    /mnt/data/models/ai-sage/GigaChat3-702B-A36B-preview-bf16/

I then managed to quantize it to Q8_0 pure using ik_llama.cpp like so:

./build/bin/llama-quantize \
    --pure \
    /mnt/data/models/ubergarm/GigaChat3-702B-A36B-preview-GGUF/GigaChat3-702B-A36B-preview-BF16-00001-of-00031.gguf \
    /mnt/data/models/ubergarm/GigaChat3-702B-A36B-preview-GGUF/GigaChat3-702B-A36B-preview-Q8_0.gguf \
    Q8_0 \
    128

It starts up in llama.cpp okay, but throws an error with the chat template:

$ export model=/mnt/data/models/ubergarm/GigaChat3-702B-A36B-preview-GGUF/GigaChat3-702B-A36B-preview-Q8_0.gguf
$ export SOCKET=1

numactl -N "$SOCKET" -m "$SOCKET" \
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/GigaChat3-702B-A36B-preview-GGUF \
    --ctx-size 65536 \
    -ctk q8_0 \
    -ub 4096 -b 4096 \
    --parallel 1 \
    --threads 96 \
    --threads-batch 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap \
    --jinja

llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = GigaChat3 702B A36B Preview Bf16
llama_model_loader: - kv   3:                           general.finetune str              = preview
llama_model_loader: - kv   4:                           general.basename str              = GigaChat3
llama_model_loader: - kv   5:                         general.size_label str              = 702B-A36B

.
.
.

llama_model_loader: - kv  49:                    tokenizer.chat_template str              = {#--------TOOL RENDERING FUNCTIONS---...
llama_model_loader: - type  f32:  379 tensors
llama_model_loader: - type q8_0:  761 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 695.00 GiB (8.50 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 2 ('</s>')
load: special tokens cache size = 14
load: token to piece cache size = 1.0295 MB
print_info: arch             = deepseek2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 7168
print_info: n_embd_inp       = 7168

.
.
.

load_tensors:          CPU model buffer size = 711682.93 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 65536
llama_context: n_ctx_seq     = 65536
llama_context: n_batch       = 4096
llama_context: n_ubatch      = 4096
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = true
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 0.025
llama_context: n_ctx_seq (65536) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     1.96 MiB
llama_kv_cache:        CPU KV buffer size =  6544.00 MiB
llama_kv_cache: size = 6544.00 MiB ( 65536 cells,  64 layers,  4/1 seqs), K (q8_0): 2448.00 MiB, V (f16): 4096.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:        CPU compute buffer size =  2616.11 MiB
llama_context: graph nodes  = 5940
llama_context: graph splits = 1
common_init_from_params: added </s> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 65536
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
common_chat_templates_init: failed to parse chat template (defaulting to chatml): Expected closing block tag at row 216, column 3:
{%- set DEVSYSTEM =
"""<role_description>
  ^
Description of the roles available in the dialog.

I just deleted the entire {%- set DEVSYSTEM =... block which seems to fix it. Maybe because you mention <role_description> twice and it is not escaped towards the bottom of that section?

So the Q8_0 seems to be working with llama.cpp... doing more testing...

Sign up or log in to comment