Qwen2.5-1.5B-Instruct Fine-tuned for Mathematical Reasoning
A fine-tuned version of Qwen2.5-1.5B-Instruct trained to solve mathematical word problems with explicit step-by-step reasoning chains.
Model Details
Model Description
This model is a QLoRA fine-tuned version of Qwen2.5-1.5B-Instruct, specifically trained to solve mathematical word problems from the GSM8K dataset. The model learns to break down complex problems into numbered reasoning steps, show intermediate calculations, and provide clear final answers.
The fine-tuning uses synthetic data generated by prompting the base model to produce detailed reasoning chains, then training on these structured examples to reinforce both mathematical accuracy and explanation quality.
- Developed by: Nishitha
- Model type: Causal Language Model (Fine-tuned with QLoRA)
- Language: English
- License: Same as base model (Qwen2.5-1.5B-Instruct)
- Finetuned from model: Qwen/Qwen2.5-1.5B-Instruct
- Fine-tuning method: QLoRA (4-bit quantization + LoRA adapters)
Model Sources
- Base Model Repository: https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct
- Training Dataset: GSM8K with synthetic reasoning chains
Uses
Direct Use
This model is designed to solve grade school math word problems with step-by-step explanations. It excels at:
- Breaking down complex math problems into manageable steps
- Showing intermediate calculations and reasoning
- Providing structured, educational responses
- Teaching mathematical problem-solving approaches
Example usage:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = PeftModel.from_pretrained(base_model, "your-username/model-name")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
prompt = "Question: Janet has 5 apples. She buys 3 more. How many does she have now?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=512)
print(tokenizer.decode(outputs[0]))
Downstream Use
Potential applications include:
- Educational tutoring systems requiring step-by-step explanations
- Math homework assistance tools
- Reasoning capability enhancement for small language models
- Foundation for further fine-tuning on domain-specific math problems
Out-of-Scope Use
This model is NOT suitable for:
- Advanced mathematics (calculus, linear algebra, etc.) - trained only on grade school math
- High-stakes decision making or professional calculations
- Problems requiring external tools, calculators, or symbolic computation
- Non-mathematical reasoning tasks
Bias, Risks, and Limitations
Known Limitations:
- 70% accuracy rate - the model makes errors on 30% of test problems
- Occasional arithmetic mistakes in multi-step calculations
- Training data generated by same-sized base model, limiting maximum achievable accuracy
- Small model size (1.5B parameters) constrains mathematical reasoning capability
- May confidently present incorrect answers with plausible-looking reasoning steps
Risks:
- Users may trust incorrect mathematical solutions if they appear well-reasoned
- Not suitable for any application where calculation accuracy is critical
- May inherit biases from the GSM8K dataset and base model
Recommendations
- Always verify answers for important calculations
- Use as an educational aid, not a calculator replacement
- Best suited for learning and demonstration rather than production applications
- Consider ensemble methods or verification steps for critical use cases
- Be aware that structured reasoning doesn't guarantee correctness
Training Details
Training Data
Dataset: GSM8K (Grade School Math 8K)
Synthetic Data Generation Process:
- Prompted Qwen2.5-1.5B-Instruct to generate detailed reasoning chains for GSM8K problems
- Created structured dataset with numbered steps, mathematical formulations, and clear final answers
- Format: Question/Answer pairs with explicit step-by-step reasoning
- Dataset uploaded to Hugging Face for reproducibility
The training data emphasizes teaching the model to show its work through:
- Numbered reasoning steps
- Intermediate calculations
- Clear problem decomposition
- Explicit final answers
Training Procedure
Hardware:
- Google Colab with T4 GPU (free tier)
- Training completed in reasonable time on consumer-grade hardware
Technique: QLoRA (Quantized Low-Rank Adaptation)
- 4-bit quantization of base model
- LoRA adapters for efficient fine-tuning
Training Hyperparameters
- LoRA Configuration:
- Rank (r): 8
- Alpha: 16
- Target modules: Attention layers
- Training regime: 4-bit quantization with LoRA adapters (QLoRA)
Evaluation
Testing Data & Metrics
Test Set: 10 sample problems from GSM8K
Evaluation Metrics:
- Answer Accuracy: Percentage of problems with correct final answers
- Reasoning Structure: Percentage of responses following step-by-step format
Results
Performance Summary:
| Metric | Score |
|---|---|
| Answer Accuracy | 7/10 (70%) |
| Reasoning Structure | 10/10 (100%) |
Key Findings:
✅ Strengths:
- 100% adoption of structured reasoning format
- All responses include intermediate calculations and explanations
- Successfully breaks down complex problems into manageable steps
- Significant improvement over base model in both structure and correctness
❌ Weaknesses:
- 30% error rate on mathematical accuracy
- Some arithmetic errors in multi-step calculations
- Incorrect answers despite showing reasoning steps
Analysis:
The model successfully learned both formatting and mathematical reasoning. The 70% accuracy with 100% structured output demonstrates effective fine-tuning. The self-teaching approach (using the same 1.5B model to generate training data) proved viable for teaching structure and improving accuracy, though there's room for improvement.
Environmental Impact
Training was conducted on Google Colab's free T4 GPU tier, minimizing environmental impact through:
- Efficient QLoRA training (4-bit quantization)
- Short training time on consumer-grade hardware
- Parameter-efficient fine-tuning (only LoRA adapters trained)
Estimated carbon footprint is minimal due to use of shared, optimized infrastructure and efficient training methods.
Technical Specifications
Model Architecture and Objective
- Base Architecture: Qwen2.5 transformer architecture (1.5B parameters)
- Fine-tuning Method: QLoRA (4-bit quantized base model + trainable LoRA adapters)
- Objective: Causal language modeling with focus on mathematical reasoning chains
Compute Infrastructure
Hardware
- GPU: NVIDIA T4 (Google Colab free tier)
- Memory: ~15GB GPU RAM (enabled by 4-bit quantization)
Software
- PEFT 0.17.1
- Transformers library (Hugging Face)
- PyTorch
- bitsandbytes (for quantization)
Citation
If you use this model, please cite:
BibTeX:
@misc{qwen25-math-reasoning,
author = {Nishitha},
title = {Qwen2.5-1.5B Fine-tuned for Mathematical Reasoning},
year = {2024},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/your-username/model-name}}
}
Related Work:
- Dettmers et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs"
- Cobbe et al. (2021). "Training Verifiers to Solve Math Word Problems" (GSM8K dataset)
Future Improvements
Potential Enhancements:
- Use larger teacher models (Llama 70B, GPT-4) for higher-quality training data generation
- Increase LoRA rank (16-32) for greater model capacity
- Expand training dataset to 5,000-10,000 examples
- Implement mathematical validation of reasoning chains
- Fine-tune larger base models (Qwen 7B/14B) for improved baseline capability
Model Card Contact
For questions or feedback about this model, please reach out through the Hugging Face model repository.
- Downloads last month
- 1
Model tree for Nishitha03/Qwen2.5-1.5b-Reasoning-Updated
Dataset used to train Nishitha03/Qwen2.5-1.5b-Reasoning-Updated
Evaluation results
- Answer Accuracy on GSM8Kself-reported70.000
- Reasoning Structure on GSM8Kself-reported100.000