File size: 6,962 Bytes
a34d586
 
9acf50b
 
 
 
 
 
 
 
 
 
 
 
9a8c89d
 
a34d586
 
9acf50b
 
028049f
 
6cce452
6c41e68
9a5b5de
a4ffb26
 
6cce452
9acf50b
 
 
22405f5
6c41e68
22405f5
464cdce
 
6c41e68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
464cdce
6c41e68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22405f5
6c41e68
a32a7c5
 
6c41e68
9acf50b
9a8c89d
9acf50b
 
 
 
 
 
 
 
 
 
 
 
 
 
09e382f
 
 
 
 
9acf50b
 
 
 
 
 
 
 
 
 
a4ffb26
 
 
 
9acf50b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a9bf00
 
9a8c89d
 
 
6c41e68
 
 
 
 
 
 
9a8c89d
2a9bf00
 
9d45abb
 
2a9bf00
 
 
9d45abb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a9bf00
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d45abb
2a9bf00
 
 
54ad4f1
2a9bf00
 
 
 
9a8c89d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
library_name: transformers
tags:
- torchao
- qwen
- qwen3
- nlp
- chat
- conversational
language:
- en
base_model:
- Qwen/Qwen3-4B
pipeline_tag: text-generation
datasets:
- HuggingFaceFW/fineweb-edu
---

# Quantization Recipe

Install `uv` by following https://docs.astral.sh/uv/getting-started/installation/

```bash
uv venv ~/.uv-hf --python 3.13
source ~/.uv-hf/bin/activate
uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard
uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao
```

## QAT Finetuning with PARQ

We apply QAT with a torchao optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The checkpoint uploaded here was trained with a LR of 4.5e-5 on 32 GPUs with a per-device batch size of 2 using an internal codebase.

An open source implementation of the training script is provided below. Adjust the `ngpu`, `device_batch_size`, `grad_accum_steps`, and `lr` variables below to fit your setup.

Fetch the training script by running `curl -O https://huggingface.co/datasets/lvj/parq-sft/resolve/main/qat_sft.py` before running the below.

```bash
source ~/.uv-hf/bin/activate

SEED=$RANDOM
SAVE_DIR=checkpoints/qwen3-2bit-fineweb-${SEED}

ngpu=8
device_batch_size=4
grad_accum_steps=2
lr=4.5e-5
TRANSFORMERS_VERBOSITY=error TOKENIZERS_PARALLELISM=$(( ngpu == 1 )) \
    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True HF_HUB_DISABLE_XET=1 \
    torchrun \
    --nproc-per-node $ngpu \
    --rdzv-endpoint localhost:$(shuf -i 29000-29500 -n 1) \
    -m qat_sft \
    --model_name_or_path Qwen/Qwen3-4B \
    --bf16 True \
    --num_train_epochs 1 \
    --per_device_train_batch_size $device_batch_size \
    --gradient_accumulation_steps $grad_accum_steps \
    --dataset_name HuggingFaceFW/fineweb-edu \
    --dataset_train_split "train[:10%]" \
    --dataloader_num_workers 4 \
    --max_length 8192 \
    --save_total_limit 1 \
    --report_to tensorboard \
    --logging_steps 2 \
    --learning_rate $lr \
    --lr_scheduler_type linear \
    --warmup_ratio 0.0 \
    --seed $SEED \
    --output_dir $SAVE_DIR \
    --enable_thinking \
    --weight_bits 2 \
    --linear_pat 'proj\.weight$' \
    --embed_pat '(lm_head|embed_tokens)'
```

## Generation from Quantized Model

Note: to `push_to_hub` you need to run
```sh
pip install -U "huggingface_hub[cli]"
huggingface-cli login
```
and use a token with write access, from https://huggingface.co/settings/tokens

To get the quantized model, run the following from the root of hf-scripts/:

```py
import os

from huggingface_hub import whoami, get_token
from transformers import (
  AutoModelForCausalLM,
  AutoTokenizer,
  set_seed,
)

set_seed(0)
model_path = f"{SAVE_DIR}"
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="auto", dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Manual testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
  {"role": "system", "content": ""},
  {"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)

start_idx = len(inputs.input_ids[0])
response_ids = model.generate(**inputs, max_new_tokens=256, **kwargs)[0]
response_ids = response_ids[start_idx:].tolist()
output_text = tokenizer.decode(response_ids, skip_special_tokens=True)
print(output_text)

# Push to hub
token = get_token()
username = whoami(token=token)["name"]
model_name = os.path.basename(model_path)
save_to = os.path.join(username, model_name)
model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)
```

The response from manual testing is:
```txt
Yes, I am conscious and can communicate with you. How can I be of service to you?
```

# Model Quality

| Benchmark | Qwen3-4B | Qwen3-4B-PARQ |
| --- | :---: | :---: |
| arc_easy | 80.26 | 73.19 |
| arc_challenge | 53.92 | 47.27 |
| boolq | 85.11	| 69.11 |
| hellaswag | 68.49	| 66.67 |
| piqa | 74.97 | 75.24 |
| winogrande | 65.67 | 65.19 |

# Exporting to ExecuTorch

⚠️ **Note:** These instructions only work on Arm-based machines.  Running them on x86_64 will fail.

We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.

To set up ExecuTorch, run the following commands:
```
git clone https://github.com/pytorch/executorch.git                   
pushd executorch           
git submodule update --init --recursive 
python install_executorch.py
popd
```

Next install the latest version of torchao:
```
git clone https://github.com/pytorch/ao.git
pushd ao 
pip install . 
popd
```
(The above command will install the right kernels on Arm-based Mac; to use Arm-based Linux define the following environment variables before pip installing torchao: BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP).

ExecuTorch's LLM export scripts require the checkpoint keys and parameters have certain names, which differ from those used in Hugging Face.
So we first use a script that converts the Hugging Face checkpoint key names to ones that ExecuTorch expects:
The following script does this for you.
```Shell
python -m executorch.examples.models.qwen3.convert_weights $(hf download lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared) pytorch_model_converted.bin
```

Once we have the checkpoint, we export it to ExecuTorch with a max_seq_length/max_context_length of 1024 using the torchao lowbit kernels as follows.
To export, we must be on an an Arm-based Mac or Linux machine. 

(Note: ExecuTorch LLM export script requires config.json have certain key names. The correct config to use for the LLM export script is located at examples/models/qwen3/config/4b_config.json within the ExecuTorch repo.)

```Shell
python -m executorch.examples.models.llama.export_llama \
  --model "qwen3_4b" \
  --checkpoint pytorch_model_converted.bin \
  --params examples/models/qwen3/config/4b_config.json \
  --output_name model.pte \
  -kv \
  --use_sdpa_with_kv_cache \
  --use-torchao-kernels \
  --max_context_length 1024 \
  --max_seq_length 1024 \
  --dtype fp32 \
  --metadata '{"get_bos_id":151644, "get_eos_ids":[151643, 151645]}'
```

After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app)).

(We try to keep these instructions up-to-date, but if you find they do not work, check out our [CI test in ExecuTorch](https://github.com/pytorch/executorch/blob/main/.ci/scripts/test_torchao_huggingface_checkpoints.sh) for the latest source of truth, and let us know we need to update our model card.)