๐Ÿงฎ Qwen3-1.7B-GRPO-Countdown

Qwen3-1.7B-GRPO-Countdown is a fine-tuned version of Qwen/Qwen3-1.7B using GRPO (Group Relative Policy Optimization) to improve mathematical reasoning and step-by-step accuracy in solving Countdown-style arithmetic problems.

Source Code

My sourcse code for fine-tuning and evaluating can be found here github.com/Tuprott991/GRPO_LLM note that I use flash_attention_2 an A100 40GB for fine-tunng

๐Ÿง  Model Overview

Property Description
Base Model Qwen3-1.7B
Fine-tuning Method GRPO (Reinforcement Learning)
Task Countdown Math Problem Solving
Dataset justinphan3110/Countdown-Tasks-3to4
Language English
Objective Improve reasoning chain and final accuracy on multi-step arithmetic tasks

๐Ÿ—๏ธ Training Details

  • Algorithm: GRPO (Group Relative Policy Optimization)
  • Reward Function:
    • +1.0: Perfect - equation is valid, uses correct numbers, equals target
    • +0.1: Partial - has tag but equation is incorrect
    • 0.0: Failed - no tag found
  • Environment: Simulated Countdown math environment
  • Batch Size: 16
  • Learning Rate: 1e-6
  • Training Steps: 180

๐Ÿ“Š Evaluation

Metric Score
Accuracy 54%

๐Ÿ’ก Example Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Vantuk/Qwen3-1.7B-Countdown"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Use 3, 4, 7, 8, 25, and 50 to make 952."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
28
Safetensors
Model size
2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Vantuk/Qwen3-1.7B-Countdown

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(329)
this model

Dataset used to train Vantuk/Qwen3-1.7B-Countdown