File size: 6,962 Bytes

---
library_name: transformers
tags:
- torchao
- qwen
- qwen3
- nlp
- chat
- conversational
language:
- en
base_model:
- Qwen/Qwen3-4B
pipeline_tag: text-generation
datasets:
- HuggingFaceFW/fineweb-edu
---

# Quantization Recipe

Install `uv` by following https://docs.astral.sh/uv/getting-started/installation/

```bash
uv venv ~/.uv-hf --python 3.13
source ~/.uv-hf/bin/activate
uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard
uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao
```

## QAT Finetuning with PARQ

We apply QAT with a torchao optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The checkpoint uploaded here was trained with a LR of 4.5e-5 on 32 GPUs with a per-device batch size of 2 using an internal codebase.

An open source implementation of the training script is provided below. Adjust the `ngpu`, `device_batch_size`, `grad_accum_steps`, and `lr` variables below to fit your setup.

Fetch the training script by running `curl -O https://huggingface.co/datasets/lvj/parq-sft/resolve/main/qat_sft.py` before running the below.

```bash
source ~/.uv-hf/bin/activate

SEED=$RANDOM
SAVE_DIR=checkpoints/qwen3-2bit-fineweb-${SEED}

ngpu=8
device_batch_size=4
grad_accum_steps=2
lr=4.5e-5
TRANSFORMERS_VERBOSITY=error TOKENIZERS_PARALLELISM=$(( ngpu == 1 )) \
    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True HF_HUB_DISABLE_XET=1 \
    torchrun \
    --nproc-per-node $ngpu \
    --rdzv-endpoint localhost:$(shuf -i 29000-29500 -n 1) \
    -m qat_sft \
    --model_name_or_path Qwen/Qwen3-4B \
    --bf16 True \
    --num_train_epochs 1 \
    --per_device_train_batch_size $device_batch_size \
    --gradient_accumulation_steps $grad_accum_steps \
    --dataset_name HuggingFaceFW/fineweb-edu \
    --dataset_train_split "train[:10%]" \
    --dataloader_num_workers 4 \
    --max_length 8192 \
    --save_total_limit 1 \
    --report_to tensorboard \
    --logging_steps 2 \
    --learning_rate $lr \
    --lr_scheduler_type linear \
    --warmup_ratio 0.0 \
    --seed $SEED \
    --output_dir $SAVE_DIR \
    --enable_thinking \
    --weight_bits 2 \
    --linear_pat 'proj\.weight$' \
    --embed_pat '(lm_head|embed_tokens)'
```

## Generation from Quantized Model

Note: to `push_to_hub` you need to run
```sh
pip install -U "huggingface_hub[cli]"
huggingface-cli login
```
and use a token with write access, from https://huggingface.co/settings/tokens

To get the quantized model, run the following from the root of hf-scripts/:

```py
import os

from huggingface_hub import whoami, get_token
from transformers import (
  AutoModelForCausalLM,
  AutoTokenizer,
  set_seed,
)

set_seed(0)
model_path = f"{SAVE_DIR}"
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="auto", dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Manual testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
  {"role": "system", "content": ""},
  {"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)

start_idx = len(inputs.input_ids[0])
response_ids = model.generate(**inputs, max_new_tokens=256, **kwargs)[0]
response_ids = response_ids[start_idx:].tolist()
output_text = tokenizer.decode(response_ids, skip_special_tokens=True)
print(output_text)

# Push to hub
token = get_token()
username = whoami(token=token)["name"]
model_name = os.path.basename(model_path)
save_to = os.path.join(username, model_name)
model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)
```

The response from manual testing is:
```txt
Yes, I am conscious and can communicate with you. How can I be of service to you?
```

# Model Quality

| Benchmark | Qwen3-4B | Qwen3-4B-PARQ |
| --- | :---: | :---: |
| arc_easy | 80.26 | 73.19 |
| arc_challenge | 53.92 | 47.27 |
| boolq | 85.11	| 69.11 |
| hellaswag | 68.49	| 66.67 |
| piqa | 74.97 | 75.24 |
| winogrande | 65.67 | 65.19 |

# Exporting to ExecuTorch

⚠️ **Note:** These instructions only work on Arm-based machines.  Running them on x86_64 will fail.

We can run the quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch).
Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), exporting and running the model on device is a breeze.

To set up ExecuTorch, run the following commands:
```
git clone https://github.com/pytorch/executorch.git                   
pushd executorch           
git submodule update --init --recursive 
python install_executorch.py
popd
```

Next install the latest version of torchao:
```
git clone https://github.com/pytorch/ao.git
pushd ao 
pip install . 
popd
```
(The above command will install the right kernels on Arm-based Mac; to use Arm-based Linux define the following environment variables before pip installing torchao: BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP).

ExecuTorch's LLM export scripts require the checkpoint keys and parameters have certain names, which differ from those used in Hugging Face.
So we first use a script that converts the Hugging Face checkpoint key names to ones that ExecuTorch expects:
The following script does this for you.
```Shell
python -m executorch.examples.models.qwen3.convert_weights $(hf download lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared) pytorch_model_converted.bin
```

Once we have the checkpoint, we export it to ExecuTorch with a max_seq_length/max_context_length of 1024 using the torchao lowbit kernels as follows.
To export, we must be on an an Arm-based Mac or Linux machine. 

(Note: ExecuTorch LLM export script requires config.json have certain key names. The correct config to use for the LLM export script is located at examples/models/qwen3/config/4b_config.json within the ExecuTorch repo.)

```Shell
python -m executorch.examples.models.llama.export_llama \
  --model "qwen3_4b" \
  --checkpoint pytorch_model_converted.bin \
  --params examples/models/qwen3/config/4b_config.json \
  --output_name model.pte \
  -kv \
  --use_sdpa_with_kv_cache \
  --use-torchao-kernels \
  --max_context_length 1024 \
  --max_seq_length 1024 \
  --dtype fp32 \
  --metadata '{"get_bos_id":151644, "get_eos_ids":[151643, 151645]}'
```

After that you can run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app)).

(We try to keep these instructions up-to-date, but if you find they do not work, check out our [CI test in ExecuTorch](https://github.com/pytorch/executorch/blob/main/.ci/scripts/test_torchao_huggingface_checkpoints.sh) for the latest source of truth, and let us know we need to update our model card.)