File size: 6,962 Bytes
a34d586 9acf50b 9a8c89d a34d586 9acf50b 028049f 6cce452 6c41e68 9a5b5de a4ffb26 6cce452 9acf50b 22405f5 6c41e68 22405f5 464cdce 6c41e68 464cdce 6c41e68 22405f5 6c41e68 a32a7c5 6c41e68 9acf50b 9a8c89d 9acf50b 09e382f 9acf50b a4ffb26 9acf50b 2a9bf00 9a8c89d 6c41e68 9a8c89d 2a9bf00 9d45abb 2a9bf00 9d45abb 2a9bf00 9d45abb 2a9bf00 54ad4f1 2a9bf00 9a8c89d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 |
---
library_name: transformers
tags:
- torchao
- qwen
- qwen3
- nlp
- chat
- conversational
language:
- en
base_model:
- Qwen/Qwen3-4B
pipeline_tag: text-generation
datasets:
- HuggingFaceFW/fineweb-edu
---
# Quantization Recipe
Install `uv` by following https://docs.astral.sh/uv/getting-started/installation/
```bash
uv venv ~/.uv-hf --python 3.13
source ~/.uv-hf/bin/activate
uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard
uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao
```
## QAT Finetuning with PARQ
We apply QAT with a torchao optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The checkpoint uploaded here was trained with a LR of 4.5e-5 on 32 GPUs with a per-device batch size of 2 using an internal codebase.
An open source implementation of the training script is provided below. Adjust the `ngpu`, `device_batch_size`, `grad_accum_steps`, and `lr` variables below to fit your setup.
Fetch the training script by running `curl -O https://huggingface.co/datasets/lvj/parq-sft/resolve/main/qat_sft.py` before running the below.
```bash
source ~/.uv-hf/bin/activate
SEED=$RANDOM
SAVE_DIR=checkpoints/qwen3-2bit-fineweb-${SEED}
ngpu=8
device_batch_size=4
grad_accum_steps=2
lr=4.5e-5
TRANSFORMERS_VERBOSITY=error TOKENIZERS_PARALLELISM=$(( ngpu == 1 )) \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True HF_HUB_DISABLE_XET=1 \
torchrun \
--nproc-per-node $ngpu \
--rdzv-endpoint localhost:$(shuf -i 29000-29500 -n 1) \
-m qat_sft \
--model_name_or_path Qwen/Qwen3-4B \
--bf16 True \
--num_train_epochs 1 \
--per_device_train_batch_size $device_batch_size \
--gradient_accumulation_steps $grad_accum_steps \
--dataset_name HuggingFaceFW/fineweb-edu \
--dataset_train_split "train[:10%]" \
--dataloader_num_workers 4 \
--max_length 8192 \
--save_total_limit 1 \
--report_to tensorboard \
--logging_steps 2 \
--learning_rate $lr \
--lr_scheduler_type linear \
--warmup_ratio 0.0 \
--seed $SEED \
--output_dir $SAVE_DIR \
--enable_thinking \
--weight_bits 2 \
--linear_pat 'proj\.weight$' \
--embed_pat '(lm_head|embed_tokens)'
```
## Generation from Quantized Model
Note: to `push_to_hub` you need to run
```sh
pip install -U "huggingface_hub[cli]"
huggingface-cli login
```
and use a token with write access, from https://huggingface.co/settings/tokens
To get the quantized model, run the following from the root of hf-scripts/:
```py
import os
from huggingface_hub import whoami, get_token
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
set_seed,
)
set_seed(0)
model_path = f"{SAVE_DIR}"
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="auto", dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Manual testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
{"role": "system", "content": ""},
{"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)
start_idx = len(inputs.input_ids[0])
response_ids = model.generate(**inputs, max_new_tokens=256, **kwargs)[0]
response_ids = response_ids[start_idx:].tolist()
output_text = tokenizer.decode(response_ids, skip_special_tokens=True)
print(output_text)
# Push to hub
token = get_token()
username = whoami(token=token)["name"]
model_name = os.path.basename(model_path)
save_to = os.path.join(username, model_name)
model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)
```
The response from manual testing is:
```txt
Yes, I am conscious and can communicate with you. How can I be of service to you?
```
# Model Quality
| Benchmark | Qwen3-4B | Qwen3-4B-PARQ |
| --- | :---: | :---: |
| arc_easy | 80.26 | 73.19 |
| arc_challenge | 53.92 | 47.27 |
| boolq | 85.11 | 69.11 |
| hellaswag | 68.49 | 66.67 |
| piqa | 74.97 | 75.24 |
| winogrande | 65.67 | 65.19 |
# Exporting to ExecuTorch
⚠️ **Note:** These instructions only work on Arm-based machines. Running them on x86_64 will fail.
We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.
To set up ExecuTorch, run the following commands:
```
git clone https://github.com/pytorch/executorch.git
pushd executorch
git submodule update --init --recursive
python install_executorch.py
popd
```
Next install the latest version of torchao:
```
git clone https://github.com/pytorch/ao.git
pushd ao
pip install .
popd
```
(The above command will install the right kernels on Arm-based Mac; to use Arm-based Linux define the following environment variables before pip installing torchao: BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP).
ExecuTorch's LLM export scripts require the checkpoint keys and parameters have certain names, which differ from those used in Hugging Face.
So we first use a script that converts the Hugging Face checkpoint key names to ones that ExecuTorch expects:
The following script does this for you.
```Shell
python -m executorch.examples.models.qwen3.convert_weights $(hf download lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared) pytorch_model_converted.bin
```
Once we have the checkpoint, we export it to ExecuTorch with a max_seq_length/max_context_length of 1024 using the torchao lowbit kernels as follows.
To export, we must be on an an Arm-based Mac or Linux machine.
(Note: ExecuTorch LLM export script requires config.json have certain key names. The correct config to use for the LLM export script is located at examples/models/qwen3/config/4b_config.json within the ExecuTorch repo.)
```Shell
python -m executorch.examples.models.llama.export_llama \
--model "qwen3_4b" \
--checkpoint pytorch_model_converted.bin \
--params examples/models/qwen3/config/4b_config.json \
--output_name model.pte \
-kv \
--use_sdpa_with_kv_cache \
--use-torchao-kernels \
--max_context_length 1024 \
--max_seq_length 1024 \
--dtype fp32 \
--metadata '{"get_bos_id":151644, "get_eos_ids":[151643, 151645]}'
```
After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app)).
(We try to keep these instructions up-to-date, but if you find they do not work, check out our [CI test in ExecuTorch](https://github.com/pytorch/executorch/blob/main/.ci/scripts/test_torchao_huggingface_checkpoints.sh) for the latest source of truth, and let us know we need to update our model card.) |