Qwen2.5-1.5B-Instruct Fine-tuned for Mathematical Reasoning

A fine-tuned version of Qwen2.5-1.5B-Instruct trained to solve mathematical word problems with explicit step-by-step reasoning chains.

Model Details

Model Description

This model is a QLoRA fine-tuned version of Qwen2.5-1.5B-Instruct, specifically trained to solve mathematical word problems from the GSM8K dataset. The model learns to break down complex problems into numbered reasoning steps, show intermediate calculations, and provide clear final answers.

The fine-tuning uses synthetic data generated by prompting the base model to produce detailed reasoning chains, then training on these structured examples to reinforce both mathematical accuracy and explanation quality.

  • Developed by: Nishitha
  • Model type: Causal Language Model (Fine-tuned with QLoRA)
  • Language: English
  • License: Same as base model (Qwen2.5-1.5B-Instruct)
  • Finetuned from model: Qwen/Qwen2.5-1.5B-Instruct
  • Fine-tuning method: QLoRA (4-bit quantization + LoRA adapters)

Model Sources

Uses

Direct Use

This model is designed to solve grade school math word problems with step-by-step explanations. It excels at:

  • Breaking down complex math problems into manageable steps
  • Showing intermediate calculations and reasoning
  • Providing structured, educational responses
  • Teaching mathematical problem-solving approaches

Example usage:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = PeftModel.from_pretrained(base_model, "your-username/model-name")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")

prompt = "Question: Janet has 5 apples. She buys 3 more. How many does she have now?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=512)
print(tokenizer.decode(outputs[0]))

Downstream Use

Potential applications include:

  • Educational tutoring systems requiring step-by-step explanations
  • Math homework assistance tools
  • Reasoning capability enhancement for small language models
  • Foundation for further fine-tuning on domain-specific math problems

Out-of-Scope Use

This model is NOT suitable for:

  • Advanced mathematics (calculus, linear algebra, etc.) - trained only on grade school math
  • High-stakes decision making or professional calculations
  • Problems requiring external tools, calculators, or symbolic computation
  • Non-mathematical reasoning tasks

Bias, Risks, and Limitations

Known Limitations:

  • 70% accuracy rate - the model makes errors on 30% of test problems
  • Occasional arithmetic mistakes in multi-step calculations
  • Training data generated by same-sized base model, limiting maximum achievable accuracy
  • Small model size (1.5B parameters) constrains mathematical reasoning capability
  • May confidently present incorrect answers with plausible-looking reasoning steps

Risks:

  • Users may trust incorrect mathematical solutions if they appear well-reasoned
  • Not suitable for any application where calculation accuracy is critical
  • May inherit biases from the GSM8K dataset and base model

Recommendations

  • Always verify answers for important calculations
  • Use as an educational aid, not a calculator replacement
  • Best suited for learning and demonstration rather than production applications
  • Consider ensemble methods or verification steps for critical use cases
  • Be aware that structured reasoning doesn't guarantee correctness

Training Details

Training Data

Dataset: GSM8K (Grade School Math 8K)

Synthetic Data Generation Process:

  • Prompted Qwen2.5-1.5B-Instruct to generate detailed reasoning chains for GSM8K problems
  • Created structured dataset with numbered steps, mathematical formulations, and clear final answers
  • Format: Question/Answer pairs with explicit step-by-step reasoning
  • Dataset uploaded to Hugging Face for reproducibility

The training data emphasizes teaching the model to show its work through:

  • Numbered reasoning steps
  • Intermediate calculations
  • Clear problem decomposition
  • Explicit final answers

Training Procedure

Hardware:

  • Google Colab with T4 GPU (free tier)
  • Training completed in reasonable time on consumer-grade hardware

Technique: QLoRA (Quantized Low-Rank Adaptation)

  • 4-bit quantization of base model
  • LoRA adapters for efficient fine-tuning

Training Hyperparameters

  • LoRA Configuration:
    • Rank (r): 8
    • Alpha: 16
    • Target modules: Attention layers
  • Training regime: 4-bit quantization with LoRA adapters (QLoRA)

Evaluation

Testing Data & Metrics

Test Set: 10 sample problems from GSM8K

Evaluation Metrics:

  1. Answer Accuracy: Percentage of problems with correct final answers
  2. Reasoning Structure: Percentage of responses following step-by-step format

Results

Performance Summary:

Metric Score
Answer Accuracy 7/10 (70%)
Reasoning Structure 10/10 (100%)

Key Findings:

Strengths:

  • 100% adoption of structured reasoning format
  • All responses include intermediate calculations and explanations
  • Successfully breaks down complex problems into manageable steps
  • Significant improvement over base model in both structure and correctness

Weaknesses:

  • 30% error rate on mathematical accuracy
  • Some arithmetic errors in multi-step calculations
  • Incorrect answers despite showing reasoning steps

Analysis:

The model successfully learned both formatting and mathematical reasoning. The 70% accuracy with 100% structured output demonstrates effective fine-tuning. The self-teaching approach (using the same 1.5B model to generate training data) proved viable for teaching structure and improving accuracy, though there's room for improvement.

Environmental Impact

Training was conducted on Google Colab's free T4 GPU tier, minimizing environmental impact through:

  • Efficient QLoRA training (4-bit quantization)
  • Short training time on consumer-grade hardware
  • Parameter-efficient fine-tuning (only LoRA adapters trained)

Estimated carbon footprint is minimal due to use of shared, optimized infrastructure and efficient training methods.

Technical Specifications

Model Architecture and Objective

  • Base Architecture: Qwen2.5 transformer architecture (1.5B parameters)
  • Fine-tuning Method: QLoRA (4-bit quantized base model + trainable LoRA adapters)
  • Objective: Causal language modeling with focus on mathematical reasoning chains

Compute Infrastructure

Hardware

  • GPU: NVIDIA T4 (Google Colab free tier)
  • Memory: ~15GB GPU RAM (enabled by 4-bit quantization)

Software

  • PEFT 0.17.1
  • Transformers library (Hugging Face)
  • PyTorch
  • bitsandbytes (for quantization)

Citation

If you use this model, please cite:

BibTeX:

@misc{qwen25-math-reasoning,
  author = {Nishitha},
  title = {Qwen2.5-1.5B Fine-tuned for Mathematical Reasoning},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/your-username/model-name}}
}

Related Work:

  • Dettmers et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs"
  • Cobbe et al. (2021). "Training Verifiers to Solve Math Word Problems" (GSM8K dataset)

Future Improvements

Potential Enhancements:

  • Use larger teacher models (Llama 70B, GPT-4) for higher-quality training data generation
  • Increase LoRA rank (16-32) for greater model capacity
  • Expand training dataset to 5,000-10,000 examples
  • Implement mathematical validation of reasoning chains
  • Fine-tune larger base models (Qwen 7B/14B) for improved baseline capability

Model Card Contact

For questions or feedback about this model, please reach out through the Hugging Face model repository.

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nishitha03/Qwen2.5-1.5b-Reasoning-Updated

Base model

Qwen/Qwen2.5-1.5B
Adapter
(524)
this model

Dataset used to train Nishitha03/Qwen2.5-1.5b-Reasoning-Updated

Evaluation results