Fine-tuned Physics VLM: Qwen2-VL-7B with LoRA & QLoRA 🧪⚡

Model Details

Model Description

A specialized vision-language model fine-tuned for physics problem solving, combining OCR capabilities with mathematical reasoning through an innovative multi-adapter training approach.

Attribute	Details
Developed by	Paranidharan
Model type	Vision-Language Model (VLM) with LoRA/QLoRA adapters
Language(s) (NLP)	English
License	Apache 2.0 (inherited from base model)
Finetuned from model	Qwen/Qwen2-VL-7B-Instruct
Fine-tuning Framework	PEFT 0.17.1 + bitsandbytes (QLoRA)
Distributed Training & Monitoring	DeepSpeed ZeRO-3 + Weights & Biases
Library	transformers, peft
Tags	lora, qlora, vision-language, physics, education, deepspeed

Model Sources

Repository: parani01/Fine-tuned-physics-VLM-on-LoRA-and-QLoRA
Base Model: Qwen/Qwen2-VL-7B-Instruct

🎯 Model Overview

This model is a fine-tuned version of Qwen2-VL-7B-Instruct specifically optimized for physics education and problem-solving tasks. The model demonstrates excellent OCR capabilities with math-friendly reasoning, making it ideal for interpreting and solving physics problems from images.

Key Capabilities

OCR Excellence: Accurate text extraction from physics diagrams and equations
Mathematical Reasoning: Solid problem-solving capabilities for physics concepts
Multi-modal Understanding: Seamless integration of visual and textual information
Structured Output: JSON-formatted responses for auto-grading compatibility

Technical Architecture

Training Strategy

Our approach uses a hybrid multi-adapter fine-tuning strategy that optimizes different model components with specialized techniques:

QLoRA (4-bit quantization) on LLM blocks using bitsandbytes + peft
Tiny LoRA (r=4/8) on vision-language projector linear layers
Frozen vision encoder to preserve pre-trained visual representations

Infrastructure & Scaling

Distributed Training Setup:

Hardware: 4×NVIDIA A6000 (24GB VRAM each) with H200 migration support
Framework: DeepSpeed ZeRO-3 + Hugging Face Accelerate for distributed computing
Precision: Mixed precision (bf16) with gradient accumulation
Effective Batch Size: 32-64 sequences across distributed setup
Total GPU Hours: 16 GPU hours (4 hours × 4 GPUs)
Distributed Strategy: DeepSpeed ZeRO-3 for memory optimization and model sharding
Monitoring: Weights & Biases integration with tqdm progress tracking

Training Configuration

# Key Training Parameters
- Base Model: Qwen2-VL-7B-Instruct
- LoRA Rank: 4-8 (vision-language projector)
- QLoRA: 4-bit quantization (LLM blocks)
- Batch Size: 32-64 effective
- Precision: bf16 mixed precision
- Optimizer: AdamW with gradient accumulation
- Epochs: 3
- Training Duration: 4 hours wall-clock (16 GPU hours total)
- GPU Type: 4×NVIDIA A6000 (24GB each)

📊 Training Details

Dataset

Trained on ScienceQA - a comprehensive multi-subject dataset with strong physics representation.

Data Format:

{
  "image": "path/to/physics_problem.jpg",
  "question": "Calculate the acceleration of the object...",
  "answer": "The acceleration is 9.8 m/s² because..."
}

Training Metrics

Training Efficiency:

Total Training Time: 4 hours wall-clock time
Total GPU Hours: 16 GPU hours (4×A6000 for 4 hours)
GPU Type: NVIDIA A6000 (24GB VRAM each)
Distributed Computing: DeepSpeed ZeRO-3 for efficient multi-GPU training
GPU Utilization: Optimized with gradient accumulation and mixed precision
Memory Efficiency: QLoRA reduces memory footprint by ~60%
Convergence: Stable training with consistent loss reduction across 3 epochs

Add Image Here: GPU utilization and memory usage charts

Prompt Engineering

The model uses a specialized Physics-tutor system prompt with structured JSON output formatting for consistent response generation and auto-grading compatibility.

Quick Start

Installation

pip install transformers torch peft bitsandbytes accelerate deepspeed wandb

Usage

from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel
import torch

# Load base model and adapter
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
model = PeftModel.from_pretrained(model, "parani01/Fine-tuned-physics-VLM-on-LoRA-and-QLoRA")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Process image and question
def solve_physics_problem(image_path, question):
    # Implementation here
    pass

📈 Performance & Results

Evaluation Highlights

OCR Accuracy: Significant improvement on mathematical expressions
Reasoning Quality: Enhanced step-by-step problem solving
Response Structure: Consistent JSON formatting for automated evaluation
Multi-modal Coherence: Better integration of visual and textual information

Technical Specifications

Model Architecture

Base: Qwen2-VL-7B-Instruct (7B parameters)
Vision Encoder: Frozen (preserves pre-trained representations)
Language Model: QLoRA fine-tuned (4-bit quantization)
Projector: Tiny LoRA adapted (rank 4-8)

Compute Requirements

Training Hardware: 4×NVIDIA A6000 (24GB VRAM each)
Training Time: 4 hours wall-clock (16 total GPU hours)
Distributed Framework: DeepSpeed ZeRO-3 for memory-efficient multi-GPU training
Inference: Single A6000/RTX 4090 compatible
Memory: ~12GB VRAM for inference with quantization
Scalability: H200 support planned for future training iterations

Framework Versions

PEFT: 0.17.1
Transformers: Latest compatible version
DeepSpeed: ZeRO-3 configuration
PyTorch: 2.0+ with CUDA support
Weights & Biases: For training monitoring and logging

🎓 Use Cases

Direct Use

Educational Physics Tutoring: Step-by-step problem solving assistance
OCR + Reasoning: Extract and solve physics problems from images
Auto-grading Systems: JSON structured outputs for automated evaluation

Downstream Use

Physics Problem Analysis: Large-scale problem dataset processing
Educational AI Research: Benchmark for vision-language physics understanding
Homework Assistance Tools: Integration into educational platforms

Out-of-Scope Use

Non-physics domains: Model is specifically optimized for physics problems
Non-English languages: Training focused on English-language content
Production without GPU: Requires GPU acceleration for optimal performance

⚠️ Bias, Risks, and Limitations

Limitations

Domain Specificity: Optimized primarily for physics problems
Language Support: English-focused training data
Computational Requirements: Requires GPU for optimal performance
Dataset Bias: Limited to ScienceQA dataset characteristics

Recommendations

Users should be aware of the physics-focused training and may need additional fine-tuning for other scientific domains. GPU acceleration is recommended for production use.

🚀 How to Get Started with the Model

Basic Usage

from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel

# Load the fine-tuned model
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
model = PeftModel.from_pretrained(base_model, "parani01/Fine-tuned-physics-VLM-on-LoRA-and-QLoRA")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Your physics problem solving code here

📊 Training Details

Training Data

Dataset: ScienceQA multi-subject dataset
Format: JSONL with image, question, answer triplets
Focus: Physics problems with visual components
Size: Optimized subset for physics reasoning

Training Procedure

Training Hyperparameters

Training regime: Mixed precision bf16 with gradient accumulation
Batch Size: 32-64 effective batch size across 4 GPUs
Learning Rate: Optimized for LoRA/QLoRA setup
Epochs: 3 epochs
Distributed: DeepSpeed ZeRO-3 across 4×A6000

Speeds, Sizes, Times

Training Time: 4 hours wall-clock
Total GPU Hours: 16 hours (4×A6000)
Model Size: Base 7B + LoRA adapters
Memory Usage: ~60% reduction with QLoRA quantization

Training Carbon Footprint

Hardware Type: 4×NVIDIA A6000 (24GB each)
Hours used: 16 total GPU hours
Training Duration: 4 hours wall-clock time
Efficiency: QLoRA quantization reduces computational overhead
Optimization: DeepSpeed ZeRO-3 for memory-efficient distributed training

Carbon emissions can be estimated using the Machine Learning Impact calculator.

📚 Citation

If you use this model in your research, please cite:

@misc{physics-vlm-lora-qlora-2025,
  title={Fine-tuned Physics VLM: Multi-Adapter Training with LoRA and QLoRA for Enhanced Mathematical Reasoning},
  author={Paranidharan},
  year={2025},
  url={https://huggingface.co/parani01/Fine-tuned-physics-VLM-on-LoRA-and-QLoRA},
  note={Fine-tuned on ScienceQA using DeepSpeed ZeRO-3 distributed training}
}

🔗 Model Card Contact & Authors

Model Card Authors: Paranidharan
Contact: Available through Hugging Face model repository

Related Resources

Base Model: Qwen2-VL-7B-Instruct
Training Framework: DeepSpeed
Monitoring: Weights & Biases
LoRA Implementation: PEFT

Model Type: Vision-Language Model with Multi-Adapter Fine-tuning
Training Date: September 2025
Languages: English
Domains: Physics, Mathematics, Science Education
Hardware: 4×NVIDIA A6000, DeepSpeed ZeRO-3 Distributed Training

Downloads last month: 3

Model tree for parani01/Fine-tuned-physics-VLM-on-LoRA-and-QLoRA

Base model

Qwen/Qwen2-VL-7B

Finetuned

Qwen/Qwen2-VL-7B-Instruct

Adapter

(186)

this model

Fine-tuned Physics VLM: Qwen2-VL-7B with LoRA & QLoRA 🧪⚡

Model Details

Model Description

Model Sources

🎯 Model Overview

Key Capabilities

Technical Architecture

Training Strategy

Infrastructure & Scaling

Training Configuration

📊 Training Details

Dataset

Training Metrics

Prompt Engineering

Quick Start

Installation

Usage

📈 Performance & Results

Evaluation Highlights

Technical Specifications

Model Architecture

Compute Requirements

Framework Versions

🎓 Use Cases

Direct Use

Downstream Use

Out-of-Scope Use

⚠️ Bias, Risks, and Limitations

Limitations

Recommendations

🚀 How to Get Started with the Model

Basic Usage

📊 Training Details

Training Data

Training Procedure

Training Hyperparameters

Speeds, Sizes, Times

Training Carbon Footprint

📚 Citation

🔗 Model Card Contact & Authors

Related Resources

Model tree for parani01/Fine-tuned-physics-VLM-on-LoRA-and-QLoRA

Dataset used to train parani01/Fine-tuned-physics-VLM-on-LoRA-and-QLoRA