openai/gsm8k
Benchmark • Updated • 17.6k • 969k • 1.34k
How to use rtj1/Qwen2.5-0.5B-AWQ-FP8-Block with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="rtj1/Qwen2.5-0.5B-AWQ-FP8-Block")
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe(messages) # Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("rtj1/Qwen2.5-0.5B-AWQ-FP8-Block")
model = AutoModelForCausalLM.from_pretrained("rtj1/Qwen2.5-0.5B-AWQ-FP8-Block")
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))How to use rtj1/Qwen2.5-0.5B-AWQ-FP8-Block with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "rtj1/Qwen2.5-0.5B-AWQ-FP8-Block"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "rtj1/Qwen2.5-0.5B-AWQ-FP8-Block",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker model run hf.co/rtj1/Qwen2.5-0.5B-AWQ-FP8-Block
How to use rtj1/Qwen2.5-0.5B-AWQ-FP8-Block with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "rtj1/Qwen2.5-0.5B-AWQ-FP8-Block" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "rtj1/Qwen2.5-0.5B-AWQ-FP8-Block",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "rtj1/Qwen2.5-0.5B-AWQ-FP8-Block" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "rtj1/Qwen2.5-0.5B-AWQ-FP8-Block",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'How to use rtj1/Qwen2.5-0.5B-AWQ-FP8-Block with Docker Model Runner:
docker model run hf.co/rtj1/Qwen2.5-0.5B-AWQ-FP8-Block
This is a quantized version of Qwen/Qwen2.5-0.5B-Instruct using AWQ + FP8_BLOCK quantization scheme.
Evaluated on GSM8K benchmark:
| Metric | Score |
|---|---|
| Strict Match | 17.97% |
| Flexible Extract | 29.80% |
For better accuracy, use the FP8_DYNAMIC variant which achieves 22.67% strict match.
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"rtj1/Qwen2.5-0.5B-AWQ-FP8-Block",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("rtj1/Qwen2.5-0.5B-AWQ-FP8-Block")
prompt = "What is 25 * 4?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
from vllm import LLM, SamplingParams
llm = LLM(model="rtj1/Qwen2.5-0.5B-AWQ-FP8-Block")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)
prompts = ["What is 25 * 4?"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
Created using llm-compressor with the FP8_BLOCK scheme:
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_BLOCK",
ignore=["lm_head"]
)
oneshot(
model="Qwen/Qwen2.5-0.5B-Instruct",
recipe=recipe,
output_dir="Qwen2.5-0.5B-Instruct-awq-fp8-block"
)
Quantization time: ~4 minutes on L4 GPU
Benchmarked using lm-evaluation-harness:
lm_eval \
--model hf \
--model_args pretrained=rtj1/Qwen2.5-0.5B-AWQ-FP8-Block,dtype=auto \
--tasks gsm8k \
--batch_size 16
Evaluation time: ~82 minutes on L4 GPU
@misc{qwen2.5-awq-fp8-block,
author = {Tharun Jagarlamudi},
title = {Qwen2.5-0.5B-Instruct AWQ + FP8_BLOCK},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/rtj1/Qwen2.5-0.5B-AWQ-FP8-Block}
}
Same as base model: Qwen License