๐งฎ Qwen3-1.7B-GRPO-Countdown
Qwen3-1.7B-GRPO-Countdown is a fine-tuned version of Qwen/Qwen3-1.7B using GRPO (Group Relative Policy Optimization) to improve mathematical reasoning and step-by-step accuracy in solving Countdown-style arithmetic problems.
Source Code
My sourcse code for fine-tuning and evaluating can be found here github.com/Tuprott991/GRPO_LLM note that I use flash_attention_2 an A100 40GB for fine-tunng
๐ง Model Overview
| Property | Description |
|---|---|
| Base Model | Qwen3-1.7B |
| Fine-tuning Method | GRPO (Reinforcement Learning) |
| Task | Countdown Math Problem Solving |
| Dataset | justinphan3110/Countdown-Tasks-3to4 |
| Language | English |
| Objective | Improve reasoning chain and final accuracy on multi-step arithmetic tasks |
๐๏ธ Training Details
- Algorithm: GRPO (Group Relative Policy Optimization)
- Reward Function:
- +1.0: Perfect - equation is valid, uses correct numbers, equals target
- +0.1: Partial - has tag but equation is incorrect
- 0.0: Failed - no tag found
- Environment: Simulated Countdown math environment
- Batch Size: 16
- Learning Rate: 1e-6
- Training Steps: 180
๐ Evaluation
| Metric | Score |
|---|---|
| Accuracy | 54% |
๐ก Example Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Vantuk/Qwen3-1.7B-Countdown"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "Use 3, 4, 7, 8, 25, and 50 to make 952."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 28