Fine-tuned Physics VLM: Qwen2-VL-7B with LoRA & QLoRA π§ͺβ‘
Model Details
Model Description
A specialized vision-language model fine-tuned for physics problem solving, combining OCR capabilities with mathematical reasoning through an innovative multi-adapter training approach.
| Attribute | Details |
|---|---|
| Developed by | Paranidharan |
| Model type | Vision-Language Model (VLM) with LoRA/QLoRA adapters |
| Language(s) (NLP) | English |
| License | Apache 2.0 (inherited from base model) |
| Finetuned from model | Qwen/Qwen2-VL-7B-Instruct |
| Fine-tuning Framework | PEFT 0.17.1 + bitsandbytes (QLoRA) |
| Distributed Training & Monitoring | DeepSpeed ZeRO-3 + Weights & Biases |
| Library | transformers, peft |
| Tags | lora, qlora, vision-language, physics, education, deepspeed |
Model Sources
- Repository: parani01/Fine-tuned-physics-VLM-on-LoRA-and-QLoRA
- Base Model: Qwen/Qwen2-VL-7B-Instruct
π― Model Overview
This model is a fine-tuned version of Qwen2-VL-7B-Instruct specifically optimized for physics education and problem-solving tasks. The model demonstrates excellent OCR capabilities with math-friendly reasoning, making it ideal for interpreting and solving physics problems from images.
Key Capabilities
- OCR Excellence: Accurate text extraction from physics diagrams and equations
- Mathematical Reasoning: Solid problem-solving capabilities for physics concepts
- Multi-modal Understanding: Seamless integration of visual and textual information
- Structured Output: JSON-formatted responses for auto-grading compatibility
Technical Architecture
Training Strategy
Our approach uses a hybrid multi-adapter fine-tuning strategy that optimizes different model components with specialized techniques:
- QLoRA (4-bit quantization) on LLM blocks using
bitsandbytes+peft - Tiny LoRA (r=4/8) on vision-language projector linear layers
- Frozen vision encoder to preserve pre-trained visual representations
Infrastructure & Scaling
Distributed Training Setup:
- Hardware: 4ΓNVIDIA A6000 (24GB VRAM each) with H200 migration support
- Framework: DeepSpeed ZeRO-3 + Hugging Face Accelerate for distributed computing
- Precision: Mixed precision (bf16) with gradient accumulation
- Effective Batch Size: 32-64 sequences across distributed setup
- Total GPU Hours: 16 GPU hours (4 hours Γ 4 GPUs)
- Distributed Strategy: DeepSpeed ZeRO-3 for memory optimization and model sharding
- Monitoring: Weights & Biases integration with tqdm progress tracking
Training Configuration
# Key Training Parameters
- Base Model: Qwen2-VL-7B-Instruct
- LoRA Rank: 4-8 (vision-language projector)
- QLoRA: 4-bit quantization (LLM blocks)
- Batch Size: 32-64 effective
- Precision: bf16 mixed precision
- Optimizer: AdamW with gradient accumulation
- Epochs: 3
- Training Duration: 4 hours wall-clock (16 GPU hours total)
- GPU Type: 4ΓNVIDIA A6000 (24GB each)
π Training Details
Dataset
Trained on ScienceQA - a comprehensive multi-subject dataset with strong physics representation.
Data Format:
{
"image": "path/to/physics_problem.jpg",
"question": "Calculate the acceleration of the object...",
"answer": "The acceleration is 9.8 m/sΒ² because..."
}
Training Metrics
Training Efficiency:
- Total Training Time: 4 hours wall-clock time
- Total GPU Hours: 16 GPU hours (4ΓA6000 for 4 hours)
- GPU Type: NVIDIA A6000 (24GB VRAM each)
- Distributed Computing: DeepSpeed ZeRO-3 for efficient multi-GPU training
- GPU Utilization: Optimized with gradient accumulation and mixed precision
- Memory Efficiency: QLoRA reduces memory footprint by ~60%
- Convergence: Stable training with consistent loss reduction across 3 epochs
Add Image Here: GPU utilization and memory usage charts
Prompt Engineering
The model uses a specialized Physics-tutor system prompt with structured JSON output formatting for consistent response generation and auto-grading compatibility.
Quick Start
Installation
pip install transformers torch peft bitsandbytes accelerate deepspeed wandb
Usage
from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel
import torch
# Load base model and adapter
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
model = PeftModel.from_pretrained(model, "parani01/Fine-tuned-physics-VLM-on-LoRA-and-QLoRA")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
# Process image and question
def solve_physics_problem(image_path, question):
# Implementation here
pass
π Performance & Results
Evaluation Highlights
- OCR Accuracy: Significant improvement on mathematical expressions
- Reasoning Quality: Enhanced step-by-step problem solving
- Response Structure: Consistent JSON formatting for automated evaluation
- Multi-modal Coherence: Better integration of visual and textual information
Technical Specifications
Model Architecture
- Base: Qwen2-VL-7B-Instruct (7B parameters)
- Vision Encoder: Frozen (preserves pre-trained representations)
- Language Model: QLoRA fine-tuned (4-bit quantization)
- Projector: Tiny LoRA adapted (rank 4-8)
Compute Requirements
- Training Hardware: 4ΓNVIDIA A6000 (24GB VRAM each)
- Training Time: 4 hours wall-clock (16 total GPU hours)
- Distributed Framework: DeepSpeed ZeRO-3 for memory-efficient multi-GPU training
- Inference: Single A6000/RTX 4090 compatible
- Memory: ~12GB VRAM for inference with quantization
- Scalability: H200 support planned for future training iterations
Framework Versions
- PEFT: 0.17.1
- Transformers: Latest compatible version
- DeepSpeed: ZeRO-3 configuration
- PyTorch: 2.0+ with CUDA support
- Weights & Biases: For training monitoring and logging
π Use Cases
Direct Use
- Educational Physics Tutoring: Step-by-step problem solving assistance
- OCR + Reasoning: Extract and solve physics problems from images
- Auto-grading Systems: JSON structured outputs for automated evaluation
Downstream Use
- Physics Problem Analysis: Large-scale problem dataset processing
- Educational AI Research: Benchmark for vision-language physics understanding
- Homework Assistance Tools: Integration into educational platforms
Out-of-Scope Use
- Non-physics domains: Model is specifically optimized for physics problems
- Non-English languages: Training focused on English-language content
- Production without GPU: Requires GPU acceleration for optimal performance
β οΈ Bias, Risks, and Limitations
Limitations
- Domain Specificity: Optimized primarily for physics problems
- Language Support: English-focused training data
- Computational Requirements: Requires GPU for optimal performance
- Dataset Bias: Limited to ScienceQA dataset characteristics
Recommendations
Users should be aware of the physics-focused training and may need additional fine-tuning for other scientific domains. GPU acceleration is recommended for production use.
π How to Get Started with the Model
Basic Usage
from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel
# Load the fine-tuned model
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
model = PeftModel.from_pretrained(base_model, "parani01/Fine-tuned-physics-VLM-on-LoRA-and-QLoRA")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
# Your physics problem solving code here
π Training Details
Training Data
- Dataset: ScienceQA multi-subject dataset
- Format: JSONL with image, question, answer triplets
- Focus: Physics problems with visual components
- Size: Optimized subset for physics reasoning
Training Procedure
Training Hyperparameters
- Training regime: Mixed precision bf16 with gradient accumulation
- Batch Size: 32-64 effective batch size across 4 GPUs
- Learning Rate: Optimized for LoRA/QLoRA setup
- Epochs: 3 epochs
- Distributed: DeepSpeed ZeRO-3 across 4ΓA6000
Speeds, Sizes, Times
- Training Time: 4 hours wall-clock
- Total GPU Hours: 16 hours (4ΓA6000)
- Model Size: Base 7B + LoRA adapters
- Memory Usage: ~60% reduction with QLoRA quantization
Training Carbon Footprint
- Hardware Type: 4ΓNVIDIA A6000 (24GB each)
- Hours used: 16 total GPU hours
- Training Duration: 4 hours wall-clock time
- Efficiency: QLoRA quantization reduces computational overhead
- Optimization: DeepSpeed ZeRO-3 for memory-efficient distributed training
Carbon emissions can be estimated using the Machine Learning Impact calculator.
π Citation
If you use this model in your research, please cite:
@misc{physics-vlm-lora-qlora-2025,
title={Fine-tuned Physics VLM: Multi-Adapter Training with LoRA and QLoRA for Enhanced Mathematical Reasoning},
author={Paranidharan},
year={2025},
url={https://huggingface.co/parani01/Fine-tuned-physics-VLM-on-LoRA-and-QLoRA},
note={Fine-tuned on ScienceQA using DeepSpeed ZeRO-3 distributed training}
}
π Model Card Contact & Authors
Model Card Authors: Paranidharan
Contact: Available through Hugging Face model repository
Related Resources
- Base Model: Qwen2-VL-7B-Instruct
- Training Framework: DeepSpeed
- Monitoring: Weights & Biases
- LoRA Implementation: PEFT
Model Type: Vision-Language Model with Multi-Adapter Fine-tuning
Training Date: September 2025
Languages: English
Domains: Physics, Mathematics, Science Education
Hardware: 4ΓNVIDIA A6000, DeepSpeed ZeRO-3 Distributed Training
- Downloads last month
- 3