turboderp commited on
Commit
7933858
·
verified ·
1 Parent(s): feea792

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ quantization_config.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,27 +1,265 @@
1
  ---
 
 
 
 
 
 
2
  license: mit
3
- base_model: zai-org/GLM-4.5V
4
- base_model_relation: quantized
5
- quantized_by: turboderp
6
- tags:
7
- - exl3
8
  ---
9
 
10
- EXL3 quants of [GLM-4.5V](https://huggingface.co/zai-org/GLM-4.5V)
11
 
12
- ⚠️ Requires ExLlamaV3 v0.0.15 (or v0.0.14 `dev` branch)
 
 
13
 
14
- Base bitrates:
15
 
16
- [2.00 bits per weight](https://huggingface.co/turboderp/GLM-4.5V-exl3/tree/2.00bpw)
17
- [3.00 bits per weight](https://huggingface.co/turboderp/GLM-4.5V-exl3/tree/3.00bpw)
18
- [4.00 bits per weight](https://huggingface.co/turboderp/GLM-4.5V-exl3/tree/4.00bpw)
 
 
 
19
 
20
- Optimized:
21
 
22
- [2.13 bits per weight](https://huggingface.co/turboderp/GLM-4.5V-exl3/tree/2.13bpw)
23
- [2.32 bits per weight](https://huggingface.co/turboderp/GLM-4.5V-exl3/tree/2.32bpw)
24
- [2.55 bits per weight](https://huggingface.co/turboderp/GLM-4.5V-exl3/tree/2.55bpw)
25
- [2.80 bits per weight](https://huggingface.co/turboderp/GLM-4.5V-exl3/tree/2.80bpw)
26
- [3.07 bits per weight](https://huggingface.co/turboderp/GLM-4.5V-exl3/tree/3.07bpw)
27
- [3.49 bits per weight](https://huggingface.co/turboderp/GLM-4.5V-exl3/tree/3.49bpw)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - zai-org/GLM-4.5-Air-Base
4
+ language:
5
+ - zh
6
+ - en
7
+ library_name: transformers
8
  license: mit
9
+ pipeline_tag: image-text-to-text
 
 
 
 
10
  ---
11
 
12
+ # GLM-4.5V
13
 
14
+ <div align="center">
15
+ <img src=https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/resources/logo.svg width="40%"/>
16
+ </div>
17
 
18
+ This model is part of the GLM-V family of models, introduced in the paper [GLM-4.1V-Thinking and GLM-4.5V: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning](https://huggingface.co/papers/2507.01006).
19
 
20
+ - **Paper**: [https://huggingface.co/papers/2507.01006](https://huggingface.co/papers/2507.01006)
21
+ - **GitHub Repository**: [https://github.com/zai-org/GLM-V/](https://github.com/zai-org/GLM-V/)
22
+ - **Online Demo**: [https://chat.z.ai/](https://chat.z.ai/)
23
+ - **API Access**: [ZhipuAI Open Platform](https://docs.z.ai/guides/vlm/glm-4.5v)
24
+ - **Desktop Assistant App**: [https://huggingface.co/spaces/zai-org/GLM-4.5V-Demo-App](https://huggingface.co/spaces/zai-org/GLM-4.5V-Demo-App)
25
+ - **Discord Community**: [https://discord.com/invite/8cnQKdAprg](https://discord.com/invite/8cnQKdAprg)
26
 
27
+ ## Introduction & Model Overview
28
 
29
+ Vision-language models (VLMs) have become a key cornerstone of intelligent systems. As real-world AI tasks grow increasingly complex, VLMs urgently need to enhance reasoning capabilities beyond basic multimodal perception — improving accuracy, comprehensiveness, and intelligence — to enable complex problem solving, long-context understanding, and multimodal agents.
30
+
31
+ Through our open-source work, we aim to explore the technological frontier together with the community while empowering more developers to create exciting and innovative applications.
32
+
33
+ **This Hugging Face repository hosts the `GLM-4.5V` model, part of the `GLM-V` series.**
34
+
35
+ ### GLM-4.5V
36
+
37
+ GLM-4.5V is based on ZhipuAI’s next-generation flagship text foundation model GLM-4.5-Air (106B parameters, 12B active). It continues the technical approach of GLM-4.1V-Thinking, achieving SOTA performance among models of the same scale on 42 public vision-language benchmarks. It covers common tasks such as image, video, and document understanding, as well as GUI agent operations.
38
+
39
+ ![GLM-4.5V Benchmarks](https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/resources/bench_45v.jpeg)
40
+
41
+ Beyond benchmark performance, GLM-4.5V focuses on real-world usability. Through efficient hybrid training, it can handle diverse types of visual content, enabling full-spectrum vision reasoning, including:
42
+ - **Image reasoning** (scene understanding, complex multi-image analysis, spatial recognition)
43
+ - **Video understanding** (long video segmentation and event recognition)
44
+ - **GUI tasks** (screen reading, icon recognition, desktop operation assistance)
45
+ - **Complex chart & long document parsing** (research report analysis, information extraction)
46
+ - **Grounding** (precise visual element localization)
47
+
48
+ The model also introduces a **Thinking Mode** switch, allowing users to balance between quick responses and deep reasoning. This switch works the same as in the `GLM-4.5` language model.
49
+
50
+ ### GLM-4.1V-9B
51
+
52
+ *Contextual information about GLM-4.1V-9B is provided for completeness, as it is part of the GLM-V series and foundational to GLM-4.5V's development.*
53
+
54
+ Built on the [GLM-4-9B-0414](https://github.com/zai-org/GLM-4) foundation model, the **GLM-4.1V-9B-Thinking** model introduces a reasoning paradigm and uses RLCS (Reinforcement Learning with Curriculum Sampling) to comprehensively enhance model capabilities. It achieves the strongest performance among 10B-level VLMs and matches or surpasses the much larger Qwen-2.5-VL-72B in 18 benchmark tasks.
55
+
56
+ We also open-sourced the base model **GLM-4.1V-9B-Base** to support researchers in exploring the limits of vision-language model capabilities.
57
+
58
+ ![Reinforcement Learning with Curriculum Sampling (RLCS)](https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/resources/rl.jpeg)
59
+
60
+ Compared with the previous generation CogVLM2 and GLM-4V series, **GLM-4.1V-Thinking** brings:
61
+ 1. The series’ first reasoning-focused model, excelling in multiple domains beyond mathematics.
62
+ 2. **64k** context length support.
63
+ 3. Support for **any aspect ratio** and up to **4k** image resolution.
64
+ 4. A bilingual (Chinese/English) open-source version.
65
+
66
+ GLM-4.1V-9B-Thinking integrates the **Chain-of-Thought** reasoning mechanism, improving accuracy, richness, and interpretability. It leads on 23 out of 28 benchmark tasks at the 10B parameter scale, and outperforms Qwen-2.5-VL-72B on 18 tasks despite its smaller size.
67
+
68
+ ![GLM-4.1V-9B Benchmarks](https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/resources/bench.jpeg)
69
+
70
+ ## Project Updates
71
+
72
+ - 🔥 **News**: `2025/08/11`: We released **GLM-4.5V** with significant improvements across multiple benchmarks. We also open-sourced our handcrafted **desktop assistant app** for debugging. Once connected to GLM-4.5V, it can capture visual information from your PC screen via screenshots or screen recordings. Feel free to try it out or customize it into your own multimodal assistant. Click [here](https://huggingface.co/spaces/zai-org/GLM-4.5V-Demo-App) to download the installer or [build from source](https://github.com/zai-org/GLM-V/blob/main/examples/vllm-chat-helper/README.md)!
73
+ - **News**: `2025/07/16`: We have open-sourced the **VLM Reward System** used to train GLM-4.1V-Thinking. View the [code repository](https://github.com/zai-org/GLM-V/tree/main/glmv_reward) and run locally: `python examples/reward_system_demo.py`.
74
+ - **News**: `2025/07/01`: We released **GLM-4.1V-9B-Thinking** and its [technical report](https://arxiv.org/abs/2507.01006).
75
+
76
+ ## Model Implementation Code
77
+
78
+ * GLM-4.5V model algorithm: see the full implementation in [transformers](https://github.com/huggingface/transformers/tree/main/src/transformers/models/glm4v_moe).
79
+ * GLM-4.1V-9B-Thinking model algorithm: see the full implementation in [transformers](https://github.com/huggingface/transformers/tree/main/src/transformers/models/glm4v).
80
+ * Both models share identical multimodal preprocessing, but use different conversation templates — please distinguish carefully.
81
+
82
+ ## Usage
83
+
84
+ ### Environment Installation
85
+
86
+ For `SGLang` and `transformers`:
87
+
88
+ ```bash
89
+ pip install transformers>=4.57.1
90
+ pip install sglang>=0.5.3
91
+ ```
92
+
93
+ For `vLLM`:
94
+
95
+ ```bash
96
+ pip install vllm>=0.10.2
97
+ ```
98
+
99
+ ### Quick Start with Transformers
100
+
101
+ ```python
102
+ from transformers import AutoProcessor, Glm4vMoeForConditionalGeneration
103
+ import torch
104
+
105
+ MODEL_PATH = "zai-org/GLM-4.5V"
106
+ messages = [
107
+ {
108
+ "role": "user",
109
+ "content": [
110
+ {
111
+ "type": "image",
112
+ "url": "https://upload.wikimedia.org/wikipedia/commons/f/fa/Grayscale_8bits_palette_sample_image.png"
113
+ },
114
+ {
115
+ "type": "text",
116
+ "text": "describe this image"
117
+ }
118
+ ],
119
+ }
120
+ ]
121
+ processor = AutoProcessor.from_pretrained(MODEL_PATH)
122
+ model = Glm4vMoeForConditionalGeneration.from_pretrained(
123
+ pretrained_model_name_or_path=MODEL_PATH,
124
+ torch_dtype="auto",
125
+ device_map="auto",
126
+ )
127
+ inputs = processor.apply_chat_template(
128
+ messages,
129
+ tokenize=True,
130
+ add_generation_prompt=True,
131
+ return_dict=True,
132
+ return_tensors="pt"
133
+ ).to(model.device)
134
+ inputs.pop("token_type_ids", None)
135
+ generated_ids = model.generate(**inputs, max_new_tokens=8192)
136
+ output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)
137
+ print(output_text)
138
+ ```
139
+
140
+ The special tokens `<|begin_of_box|>` and `<|end_of_box|>` in the response mark the answer’s bounding box in the image. The bounding box is given as four numbers — for example `[x1, y1, x2, y2]`, where `(x1, y1)` is the top-left corner and `(x2, y2`)` is the bottom-right corner. The bracket style may vary ([], [[]], (), <>, etc.), but the meaning is the same: it encloses the coordinates of the box. These coordinates are relative values between 0 and 1000, normalized to the image size.
141
+
142
+ For more code information, please visit our [GitHub](https://github.com/zai-org/GLM-V/).
143
+
144
+ ### Grounding Example
145
+
146
+ GLM-4.5V equips precise grounding capabilities. Given a prompt that requests the location of a specific object, GLM-4.5V is able to reasoning step-by-step and identify the bounding boxes of the target object. The query prompt supports complex descriptions of the target object as well as specified output formats, for example:
147
+
148
+ > - Help me to locate <expr> in the image and give me its bounding boxes.
149
+ > - Please pinpoint the bounding box [[x1,y1,x2,y2], …] in the image as per the given description. <expr>
150
+
151
+ Here, `<expr>` is the description of the target object. The output bounding box is a quadruple $$[x_1,y_1,x_2,y_2]$$ composed of the coordinates of the top-left and bottom-right corners, where each value is normalized by the image width (for x) or height (for y) and scaled by 1000.
152
+
153
+ In the response, the special tokens `<|begin_of_box|>` and `<|end_of_box|>` are used to mark the image bounding box in the answer. The bracket style may vary ([], [[]], (), <>, etc.), but the meaning is the same: to enclose the coordinates of the box.
154
+
155
+ ### GUI Agent Example
156
+
157
+ - `examples/gui-agent`: Demonstrates prompt construction and output handling for GUI Agents, including strategies for mobile, PC, and web. Prompt templates differ between GLM-4.1V and GLM-4.5V.
158
+
159
+ ### Quick Demo Application
160
+
161
+ - `examples/vlm-helper`: A desktop assistant for GLM multimodal models (mainly GLM-4.5V, compatible with GLM-4.1V), supporting text, images, videos, PDFs, PPTs, and more. Connects to the GLM multimodal API for intelligent services across scenarios. Download the [installer](https://huggingface.co/spaces/zai-org/GLM-4.5V-Demo-App) or [build from source](https://github.com/zai-org/GLM-V/blob/main/examples/vlm-helper/README.md).
162
+
163
+ ### vLLM
164
+
165
+ ```bash
166
+ vllm serve zai-org/GLM-4.5V \
167
+ --tensor-parallel-size 4 \
168
+ --tool-call-parser glm45 \
169
+ --reasoning-parser glm45 \
170
+ --enable-auto-tool-choice \
171
+ --served-model-name glm-4.5v \
172
+ --allowed-local-media-path / \
173
+ --media-io-kwargs '{"video": {"num_frames": -1}}'
174
+ ```
175
+
176
+ ### SGLang
177
+
178
+ ```shell
179
+ python3 -m sglang.launch_server --model-path zai-org/GLM-4.5V \
180
+ --tp-size 4 \
181
+ --tool-call-parser glm45 \
182
+ --reasoning-parser glm45 \
183
+ --served-model-name glm-4.5v \
184
+ --port 8000 \
185
+ --host 0.0.0.0
186
+ ```
187
+
188
+ Notes:
189
+ - We recommend using the `FA3` attention backend in SGLang for higher inference performance and lower memory usage:
190
+ `--attention-backend fa3 --mm-attention-backend fa3 --enable-torch-compile`
191
+ Without `FA3`, large video inference may cause out-of-memory (OOM) errors.
192
+ We also recommend increasing `SGLANG_VLM_CACHE_SIZE_MB` (e.g., `1024`) to provide sufficient cache space for video understanding.
193
+ - When using `vLLM` and `SGLang`, thinking mode is enabled by default. To disable the thinking switch, add:
194
+ `extra_body={"chat_template_kwargs": {"enable_thinking": False}}`
195
+
196
+ ## Model Fine-tuning
197
+
198
+ [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) already supports fine-tuning for GLM-4.5V & GLM-4.1V-9B-Thinking models. Below is an example of dataset construction using two images. You should organize your dataset into `finetune.json` in the following format, This is an example for fine-tuning GLM-4.1V-9B.
199
+
200
+ ```json
201
+ [
202
+ {
203
+ "messages": [
204
+ {
205
+ "content": "<image>Who are they?",
206
+ "role": "user"
207
+ },
208
+ {
209
+ "content": "<think>
210
+ User asked me to observe the image and find the answer. I know they are Kane and Goretzka from Bayern Munich.</think>
211
+ <answer>They're Kane and Goretzka from Bayern Munich.</answer>",
212
+ "role": "assistant"
213
+ },
214
+ {
215
+ "content": "<image>What are they doing?",
216
+ "role": "user"
217
+ },
218
+ {
219
+ "content": "<think>
220
+ I need to observe what these people are doing. Oh, they are celebrating on the soccer field.</think>
221
+ <answer>They are celebrating on the soccer field.</answer>",
222
+ "role": "assistant"
223
+ }
224
+ ],
225
+ "images": [
226
+ "mllm_demo_data/1.jpg",
227
+ "mllm_demo_data/2.jpg"
228
+ ]
229
+ }
230
+ ]
231
+ ```
232
+
233
+ 1. The content inside `<think> ... </think>` will **not** be stored as conversation history or in fine-tuning data.
234
+ 2. The `<image>` tag will be replaced with the corresponding image information.
235
+ 3. For the GLM-4.5V model, the <answer> and </answer> tags should be removed.
236
+
237
+ Then, you can fine-tune following the standard LLaMA-Factory procedure.
238
+
239
+ ## Fixed and Remaining Issues
240
+
241
+ Since the release of GLM-4.1V, we have addressed many community-reported issues. In GLM-4.5V, common issues such as repetitive thinking and incorrect output formatting are alleviated. However, some limitations remain:
242
+
243
+ 1. In frontend code reproduction cases, the model may output raw HTML without proper markdown wrapping. There may also be character escaping issues, potentially causing rendering errors. We provide a [patch](https://github.com/zai-org/GLM-V/blob/main/inference/html_detector.py) to fix most cases.
244
+ 2. Pure text Q&A capabilities still have room for improvement, as this release focused primarily on multimodal scenarios.
245
+ 3. In some cases, the model may overthink or repeat content, especially for complex prompts.
246
+ 4. Occasionally, the model may restate the answer at the end.
247
+ 5. There are some perception issues, with room for improvement in tasks such as counting and identifying specific individuals.
248
+
249
+ We welcome feedback in the issue section and will address problems as quickly as possible.
250
+
251
+ ## Citation
252
+
253
+ If you use this model, please cite the following paper:
254
+
255
+ ```bibtex
256
+ @misc{vteam2025glm45vglm41vthinkingversatilemultimodal,
257
+ title={GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning},
258
+ author={V Team and Wenyi Hong and Wenmeng Yu and Xiaotao Gu and Guo Wang and Guobing Gan and Haomiao Tang and Jiale Cheng and Ji Qi and Junhui Ji and Lihang Pan and Shuaiqi Duan and Weihan Wang and Yan Wang and Yean Cheng and Zehai He and Zhe Su and Zhen Yang and Ziyang Pan and Aohan Zeng and Baoxu Wang and Bin Chen and Boyan Shi and Changyu Pang and Chenhui Zhang and Da Yin and Fan Yang and Guoqing Chen and Jiazheng Xu and Jiale Zhu and Jiali Chen and Jing Chen and Jinhao Chen and Jinghao Lin and Jinjiang Wang and Junjie Chen and Leqi Lei and Letian Gong and Leyi Pan and Mingdao Liu and Mingde Xu and Mingzhi Zhang and Qinkai Zheng and Sheng Yang and Shi Zhong and Shiyu Huang and Shuyuan Zhao and Siyan Xue and Shangqin Tu and Shengbiao Meng and Tianshu Zhang and Tianwei Luo and Tianxiang Hao and Tianyu Tong and Wenkai Li and Wei Jia and Xiao Liu and Xiaohan Zhang and Xin Lyu and Xinyue Fan and Xuancheng Huang and Yanling Wang and Yadong Xue and Yanfeng Wang and Yanzi Wang and Yifan An and Yifan Du and Yiming Shi and Yiheng Huang and Yilin Niu and Yuan Wang and Yuanchang Yue and Yuchen Li and Yutao Zhang and Yuting Wang and Yu Wang and Yuxuan Zhang and Zhao Xue and Zhenyu Hou and Zhengxiao Du and Zihan Wang and Peng Zhang and Debing Liu and Bin Xu and Juanzi Li and Minlie Huang and Yuxiao Dong and Jie Tang},
259
+ year={2025},
260
+ eprint={2507.01006},
261
+ archivePrefix={arXiv},
262
+ primaryClass={cs.CV},
263
+ url={https://arxiv.org/abs/2507.01006},
264
+ }
265
+ ```
chat_template.jinja ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [gMASK]<sop>
2
+ {%- if tools -%}
3
+ <|system|>
4
+ # Tools
5
+
6
+ You may call one or more functions to assist with the user query.
7
+
8
+ You are provided with function signatures within <tools></tools> XML tags:
9
+ <tools>
10
+ {% for tool in tools %}
11
+ {{ tool | tojson(ensure_ascii=False) }}
12
+ {% endfor %}
13
+ </tools>
14
+
15
+ For each function call, output the function name and arguments within the following XML format:
16
+ <tool_call>{function-name}
17
+ <arg_key>{arg-key-1}</arg_key>
18
+ <arg_value>{arg-value-1}</arg_value>
19
+ <arg_key>{arg-key-2}</arg_key>
20
+ <arg_value>{arg-value-2}</arg_value>
21
+ ...
22
+ </tool_call>{%- endif -%}
23
+ {%- macro visible_text(content) -%}
24
+ {%- if content is string -%}
25
+ {{- content }}
26
+ {%- elif content is iterable and content is not mapping -%}
27
+ {%- for item in content -%}
28
+ {%- if item is mapping and item.type == 'text' -%}
29
+ {{- item.text }}
30
+ {%- elif item is mapping and (item.type == 'image' or 'image' in item) -%}
31
+ <|begin_of_image|><|image|><|end_of_image|>
32
+ {%- elif item is mapping and (item.type == 'video' or 'video' in item) -%}
33
+ <|begin_of_video|><|video|><|end_of_video|>
34
+ {%- elif item is string -%}
35
+ {{- item }}
36
+ {%- endif -%}
37
+ {%- endfor -%}
38
+ {%- else -%}
39
+ {{- content }}
40
+ {%- endif -%}
41
+ {%- endmacro -%}
42
+ {%- set ns = namespace(last_user_index=-1) %}
43
+ {%- for m in messages %}
44
+ {%- if m.role == 'user' %}
45
+ {% set ns.last_user_index = loop.index0 -%}
46
+ {%- endif %}
47
+ {%- endfor %}
48
+ {% for m in messages %}
49
+ {%- if m.role == 'user' -%}<|user|>
50
+ {% if m.content is string %}
51
+ {{ m.content }}
52
+ {%- else %}
53
+ {%- for item in m.content %}
54
+ {% if item.type == 'video' or 'video' in item %}
55
+ <|begin_of_video|><|video|><|end_of_video|>{% elif item.type == 'image' or 'image' in item %}
56
+ <|begin_of_image|><|image|><|end_of_image|>{% elif item.type == 'text' %}
57
+ {{ item.text }}
58
+ {%- endif %}
59
+ {%- endfor %}
60
+ {%- endif %}
61
+ {{- '/nothink' if (enable_thinking is defined and not enable_thinking and not visible_text(m.content).endswith("/nothink")) else '' -}}
62
+ {%- elif m.role == 'assistant' -%}
63
+ <|assistant|>
64
+ {%- set reasoning_content = '' %}
65
+ {%- set content = visible_text(m.content) %}
66
+ {%- if m.reasoning_content is string %}
67
+ {%- set reasoning_content = m.reasoning_content %}
68
+ {%- else %}
69
+ {%- if '</think>' in content %}
70
+ {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
71
+ {%- set content = content.split('</think>')[-1].lstrip('\n') %}
72
+ {%- endif %}
73
+ {%- endif %}
74
+ {%- if loop.index0 > ns.last_user_index and reasoning_content -%}
75
+ {{ '\n<think>' + reasoning_content.strip() + '</think>'}}
76
+ {%- else -%}
77
+ {{ '\n<think></think>' }}
78
+ {%- endif -%}
79
+ {%- if content.strip() -%}
80
+ {{ '\n' + content.strip() }}
81
+ {%- endif -%}
82
+ {% if m.tool_calls %}
83
+ {% for tc in m.tool_calls %}
84
+ {%- if tc.function %}
85
+ {%- set tc = tc.function %}
86
+ {%- endif %}
87
+ {{ '\n<tool_call>' + tc.name }}
88
+ {% set _args = tc.arguments %}
89
+ {% for k, v in _args.items() %}
90
+ <arg_key>{{ k }}</arg_key>
91
+ <arg_value>{{ v | tojson(ensure_ascii=False) if v is not string else v }}</arg_value>
92
+ {% endfor %}
93
+ </tool_call>{% endfor %}
94
+ {% endif %}
95
+ {%- elif m.role == 'tool' -%}
96
+ {%- if m.content is string -%}
97
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
98
+ {{- '<|observation|>' }}
99
+ {%- endif %}
100
+ {{- '\n<tool_response>\n' }}
101
+ {{- m.content }}
102
+ {{- '\n</tool_response>' }}
103
+ {%- else -%}
104
+ <|observation|>{% for tr in m.content %}
105
+
106
+ <tool_response>
107
+ {{ tr.output if tr.output is defined else tr }}
108
+ </tool_response>{% endfor -%}
109
+ {% endif -%}
110
+ {%- elif m.role == 'system' -%}
111
+ <|system|>
112
+ {{ visible_text(m.content) }}
113
+ {%- endif -%}
114
+ {%- endfor -%}
115
+ {%- if add_generation_prompt -%}
116
+ <|assistant|>
117
+ {{'<think></think>\n' if (enable_thinking is defined and not enable_thinking) else ''}}
118
+ {%- endif -%}
config.json ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Glm4vMoeForConditionalGeneration"
4
+ ],
5
+ "model_type": "glm4v_moe",
6
+ "image_start_token_id": 151339,
7
+ "image_end_token_id": 151340,
8
+ "video_start_token_id": 151341,
9
+ "video_end_token_id": 151342,
10
+ "image_token_id": 151363,
11
+ "video_token_id": 151364,
12
+ "tie_word_embeddings": false,
13
+ "transformers_version": "4.57.1",
14
+ "text_config": {
15
+ "model_type": "glm4v_moe_text",
16
+ "pad_token_id": 151329,
17
+ "vocab_size": 151552,
18
+ "eos_token_id": [
19
+ 151329,
20
+ 151336,
21
+ 151338
22
+ ],
23
+ "head_dim": 128,
24
+ "attention_bias": true,
25
+ "attention_dropout": 0.0,
26
+ "first_k_dense_replace": 1,
27
+ "hidden_act": "silu",
28
+ "hidden_size": 4096,
29
+ "initializer_range": 0.02,
30
+ "intermediate_size": 10944,
31
+ "max_position_embeddings": 65536,
32
+ "moe_intermediate_size": 1408,
33
+ "n_group": 1,
34
+ "n_routed_experts": 128,
35
+ "n_shared_experts": 1,
36
+ "norm_topk_prob": true,
37
+ "num_attention_heads": 96,
38
+ "num_experts_per_tok": 8,
39
+ "num_hidden_layers": 46,
40
+ "num_key_value_heads": 8,
41
+ "partial_rotary_factor": 0.5,
42
+ "rms_norm_eps": 1e-05,
43
+ "dtype": "bfloat16",
44
+ "rope_scaling": {
45
+ "rope_type": "default",
46
+ "mrope_section": [
47
+ 8,
48
+ 12,
49
+ 12
50
+ ]
51
+ },
52
+ "rope_theta": 10000.0,
53
+ "routed_scaling_factor": 1.0,
54
+ "topk_group": 1,
55
+ "use_cache": true,
56
+ "use_qk_norm": false
57
+ },
58
+ "vision_config": {
59
+ "model_type": "glm4v_moe",
60
+ "attention_bias": false,
61
+ "attention_dropout": 0.0,
62
+ "depth": 24,
63
+ "hidden_act": "silu",
64
+ "hidden_size": 1536,
65
+ "image_size": 336,
66
+ "in_channels": 3,
67
+ "initializer_range": 0.02,
68
+ "intermediate_size": 10944,
69
+ "num_heads": 12,
70
+ "out_hidden_size": 4096,
71
+ "patch_size": 14,
72
+ "rms_norm_eps": 1e-05,
73
+ "spatial_merge_size": 2,
74
+ "temporal_patch_size": 2
75
+ },
76
+ "quantization_config": {
77
+ "quant_method": "exl3",
78
+ "version": "0.0.14",
79
+ "bits": 3.49,
80
+ "head_bits": 6,
81
+ "calibration": {
82
+ "rows": 250,
83
+ "cols": 2048
84
+ },
85
+ "out_scales": "auto",
86
+ "codebook": "mcg"
87
+ }
88
+ }
generation_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "do_sample": true,
4
+ "eos_token_id": [
5
+ 151329,
6
+ 151336,
7
+ 151338
8
+ ],
9
+ "pad_token_id": 151329,
10
+ "temperature": 1.0,
11
+ "top_k": 1,
12
+ "top_p": 0.0001,
13
+ "transformers_version": "4.57.1"
14
+ }
model-00001-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:59f7581f0551104c7a7aac2a548ed342f32c2e114a28a939e8fee0fb1c619161
3
+ size 8420598380
model-00002-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bf5d4690c1f6019537e4d3594859b107aa7d5ee14d66a31032b97f1ea6dc2a43
3
+ size 8232796580
model-00003-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:69c1aa8b59c9c0a30517d675f35196e9350458462757ae6a03fb4e692704637e
3
+ size 7955977228
model-00004-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f004017af0778d4df8082a2ab58803af2513ebbcdec78a527ac87238970a9214
3
+ size 8370440524
model-00005-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:48acaa0f1813ccd1c67c7f33701bb21a28ae4814e04ac8912a535fcaaaea7b2a
3
+ size 8093616268
model-00006-of-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:15486180f21ae04bb2884046d660551238faea2e2fbd8556c43998b4c5629ca6
3
+ size 8478331296
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
preprocessor_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "size": {"shortest_edge": 12544, "longest_edge": 9633792},
3
+ "do_rescale": true,
4
+ "patch_size": 14,
5
+ "temporal_patch_size": 2,
6
+ "merge_size": 2,
7
+ "image_mean": [0.48145466, 0.4578275, 0.40821073],
8
+ "image_std": [0.26862954, 0.26130258, 0.27577711],
9
+ "image_processor_type": "Glm4vImageProcessor",
10
+ "processor_class": "Glm4vProcessor"
11
+ }
quantization_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cef78289986207362a699c6fe84cef67630c5d9620522b565c054f259ba0ecca
3
+ size 22952478
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9340665016419c825c4bdabbcc9acc43b7ca2c68ce142724afa829abb1be5efd
3
+ size 19970699
tokenizer_config.json ADDED
@@ -0,0 +1,327 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "151329": {
4
+ "content": "<|endoftext|>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "151330": {
12
+ "content": "[MASK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "151331": {
20
+ "content": "[gMASK]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "151332": {
28
+ "content": "[sMASK]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "151333": {
36
+ "content": "<sop>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "151334": {
44
+ "content": "<eop>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "151335": {
52
+ "content": "<|system|>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "151336": {
60
+ "content": "<|user|>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "151337": {
68
+ "content": "<|assistant|>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "151338": {
76
+ "content": "<|observation|>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ "151339": {
84
+ "content": "<|begin_of_image|>",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ },
91
+ "151340": {
92
+ "content": "<|end_of_image|>",
93
+ "lstrip": false,
94
+ "normalized": false,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": true
98
+ },
99
+ "151341": {
100
+ "content": "<|begin_of_video|>",
101
+ "lstrip": false,
102
+ "normalized": false,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": true
106
+ },
107
+ "151342": {
108
+ "content": "<|end_of_video|>",
109
+ "lstrip": false,
110
+ "normalized": false,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": true
114
+ },
115
+ "151343": {
116
+ "content": "<|begin_of_audio|>",
117
+ "lstrip": false,
118
+ "normalized": false,
119
+ "rstrip": false,
120
+ "single_word": false,
121
+ "special": true
122
+ },
123
+ "151344": {
124
+ "content": "<|end_of_audio|>",
125
+ "lstrip": false,
126
+ "normalized": false,
127
+ "rstrip": false,
128
+ "single_word": false,
129
+ "special": true
130
+ },
131
+ "151345": {
132
+ "content": "<|begin_of_transcription|>",
133
+ "lstrip": false,
134
+ "normalized": false,
135
+ "rstrip": false,
136
+ "single_word": false,
137
+ "special": true
138
+ },
139
+ "151346": {
140
+ "content": "<|end_of_transcription|>",
141
+ "lstrip": false,
142
+ "normalized": false,
143
+ "rstrip": false,
144
+ "single_word": false,
145
+ "special": true
146
+ },
147
+ "151347": {
148
+ "content": "<|code_prefix|>",
149
+ "lstrip": false,
150
+ "normalized": false,
151
+ "rstrip": false,
152
+ "single_word": false,
153
+ "special": true
154
+ },
155
+ "151348": {
156
+ "content": "<|code_middle|>",
157
+ "lstrip": false,
158
+ "normalized": false,
159
+ "rstrip": false,
160
+ "single_word": false,
161
+ "special": true
162
+ },
163
+ "151349": {
164
+ "content": "<|code_suffix|>",
165
+ "lstrip": false,
166
+ "normalized": false,
167
+ "rstrip": false,
168
+ "single_word": false,
169
+ "special": true
170
+ },
171
+ "151350": {
172
+ "content": "<think>",
173
+ "lstrip": false,
174
+ "normalized": false,
175
+ "rstrip": false,
176
+ "single_word": false,
177
+ "special": false
178
+ },
179
+ "151351": {
180
+ "content": "</think>",
181
+ "lstrip": false,
182
+ "normalized": false,
183
+ "rstrip": false,
184
+ "single_word": false,
185
+ "special": false
186
+ },
187
+ "151352": {
188
+ "content": "<tool_call>",
189
+ "lstrip": false,
190
+ "normalized": false,
191
+ "rstrip": false,
192
+ "single_word": false,
193
+ "special": false
194
+ },
195
+ "151353": {
196
+ "content": "</tool_call>",
197
+ "lstrip": false,
198
+ "normalized": false,
199
+ "rstrip": false,
200
+ "single_word": false,
201
+ "special": false
202
+ },
203
+ "151354": {
204
+ "content": "<tool_response>",
205
+ "lstrip": false,
206
+ "normalized": false,
207
+ "rstrip": false,
208
+ "single_word": false,
209
+ "special": false
210
+ },
211
+ "151355": {
212
+ "content": "</tool_response>",
213
+ "lstrip": false,
214
+ "normalized": false,
215
+ "rstrip": false,
216
+ "single_word": false,
217
+ "special": false
218
+ },
219
+ "151356": {
220
+ "content": "<arg_key>",
221
+ "lstrip": false,
222
+ "normalized": false,
223
+ "rstrip": false,
224
+ "single_word": false,
225
+ "special": false
226
+ },
227
+ "151357": {
228
+ "content": "</arg_key>",
229
+ "lstrip": false,
230
+ "normalized": false,
231
+ "rstrip": false,
232
+ "single_word": false,
233
+ "special": false
234
+ },
235
+ "151358": {
236
+ "content": "<arg_value>",
237
+ "lstrip": false,
238
+ "normalized": false,
239
+ "rstrip": false,
240
+ "single_word": false,
241
+ "special": false
242
+ },
243
+ "151359": {
244
+ "content": "</arg_value>",
245
+ "lstrip": false,
246
+ "normalized": false,
247
+ "rstrip": false,
248
+ "single_word": false,
249
+ "special": false
250
+ },
251
+ "151360": {
252
+ "content": "/nothink",
253
+ "lstrip": false,
254
+ "normalized": false,
255
+ "rstrip": false,
256
+ "single_word": false,
257
+ "special": true
258
+ },
259
+ "151361": {
260
+ "content": "<|begin_of_box|>",
261
+ "lstrip": false,
262
+ "normalized": false,
263
+ "rstrip": false,
264
+ "single_word": false,
265
+ "special": false
266
+ },
267
+ "151362": {
268
+ "content": "<|end_of_box|>",
269
+ "lstrip": false,
270
+ "normalized": false,
271
+ "rstrip": false,
272
+ "single_word": false,
273
+ "special": false
274
+ },
275
+ "151363": {
276
+ "content": "<|image|>",
277
+ "lstrip": false,
278
+ "normalized": false,
279
+ "rstrip": false,
280
+ "single_word": false,
281
+ "special": false
282
+ },
283
+ "151364": {
284
+ "content": "<|video|>",
285
+ "lstrip": false,
286
+ "normalized": false,
287
+ "rstrip": false,
288
+ "single_word": false,
289
+ "special": false
290
+ }
291
+ },
292
+ "additional_special_tokens": [
293
+ "<|endoftext|>",
294
+ "[MASK]",
295
+ "[gMASK]",
296
+ "[sMASK]",
297
+ "<sop>",
298
+ "<eop>",
299
+ "<|system|>",
300
+ "<|user|>",
301
+ "<|assistant|>",
302
+ "<|observation|>",
303
+ "<|begin_of_image|>",
304
+ "<|end_of_image|>",
305
+ "<|begin_of_video|>",
306
+ "<|end_of_video|>",
307
+ "<|begin_of_audio|>",
308
+ "<|end_of_audio|>",
309
+ "<|image|>",
310
+ "<|video|>",
311
+ "<|begin_of_transcription|>",
312
+ "<|end_of_transcription|>",
313
+ "<|code_prefix|>",
314
+ "<|code_middle|>",
315
+ "<|code_suffix|>",
316
+ "/nothink"
317
+ ],
318
+ "clean_up_tokenization_spaces": false,
319
+ "do_lower_case": false,
320
+ "eos_token": "<|endoftext|>",
321
+ "extra_special_tokens": {},
322
+ "model_max_length": 128000,
323
+ "pad_token": "<|endoftext|>",
324
+ "padding_side": "left",
325
+ "remove_space": false,
326
+ "tokenizer_class": "PreTrainedTokenizer"
327
+ }
video_preprocessor_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "size": {"shortest_edge": 12544, "longest_edge": 47040000},
3
+ "do_rescale": true,
4
+ "patch_size": 14,
5
+ "temporal_patch_size": 2,
6
+ "merge_size": 2,
7
+ "image_mean": [0.48145466, 0.4578275, 0.40821073],
8
+ "image_std": [0.26862954, 0.26130258, 0.27577711],
9
+ "video_processor_type": "Glm4vVideoProcessor",
10
+ "processor_class": "Glm4vProcessor"
11
+ }