🧮 Qwen3-1.7B-GRPO-Countdown

Qwen3-1.7B-GRPO-Countdown is a fine-tuned version of Qwen/Qwen3-1.7B using GRPO (Group Relative Policy Optimization) to improve mathematical reasoning and step-by-step accuracy in solving Countdown-style arithmetic problems.

Source Code

My sourcse code for fine-tuning and evaluating can be found here github.com/Tuprott991/GRPO_LLM note that I use flash_attention_2 an A100 40GB for fine-tunng

🧠 Model Overview

Property	Description
Base Model	Qwen3-1.7B
Fine-tuning Method	GRPO (Reinforcement Learning)
Task	Countdown Math Problem Solving
Dataset	justinphan3110/Countdown-Tasks-3to4
Language	English
Objective	Improve reasoning chain and final accuracy on multi-step arithmetic tasks

🏗️ Training Details

Algorithm: GRPO (Group Relative Policy Optimization)
Reward Function:
- +1.0: Perfect - equation is valid, uses correct numbers, equals target
- +0.1: Partial - has tag but equation is incorrect
- 0.0: Failed - no tag found
Environment: Simulated Countdown math environment
Batch Size: 16
Learning Rate: 1e-6
Training Steps: 180

📊 Evaluation

Metric	Score
Accuracy	54%

💡 Example Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Vantuk/Qwen3-1.7B-Countdown"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Use 3, 4, 7, 8, 25, and 50 to make 952."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: 28

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for Vantuk/Qwen3-1.7B-Countdown

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

(329)

this model

Vantuk
/

Qwen3-1.7B-Countdown