Fine-tuned Physics VLM: Qwen2-VL-7B with LoRA & QLoRA πŸ§ͺ⚑

Model Details

Model Description

A specialized vision-language model fine-tuned for physics problem solving, combining OCR capabilities with mathematical reasoning through an innovative multi-adapter training approach.

Attribute Details
Developed by Paranidharan
Model type Vision-Language Model (VLM) with LoRA/QLoRA adapters
Language(s) (NLP) English
License Apache 2.0 (inherited from base model)
Finetuned from model Qwen/Qwen2-VL-7B-Instruct
Fine-tuning Framework PEFT 0.17.1 + bitsandbytes (QLoRA)
Distributed Training & Monitoring DeepSpeed ZeRO-3 + Weights & Biases
Library transformers, peft
Tags lora, qlora, vision-language, physics, education, deepspeed

Model Sources

🎯 Model Overview

This model is a fine-tuned version of Qwen2-VL-7B-Instruct specifically optimized for physics education and problem-solving tasks. The model demonstrates excellent OCR capabilities with math-friendly reasoning, making it ideal for interpreting and solving physics problems from images.

Key Capabilities

  • OCR Excellence: Accurate text extraction from physics diagrams and equations
  • Mathematical Reasoning: Solid problem-solving capabilities for physics concepts
  • Multi-modal Understanding: Seamless integration of visual and textual information
  • Structured Output: JSON-formatted responses for auto-grading compatibility

Technical Architecture

Training Strategy

Our approach uses a hybrid multi-adapter fine-tuning strategy that optimizes different model components with specialized techniques:

  • QLoRA (4-bit quantization) on LLM blocks using bitsandbytes + peft
  • Tiny LoRA (r=4/8) on vision-language projector linear layers
  • Frozen vision encoder to preserve pre-trained visual representations

Infrastructure & Scaling

Distributed Training Setup:

  • Hardware: 4Γ—NVIDIA A6000 (24GB VRAM each) with H200 migration support
  • Framework: DeepSpeed ZeRO-3 + Hugging Face Accelerate for distributed computing
  • Precision: Mixed precision (bf16) with gradient accumulation
  • Effective Batch Size: 32-64 sequences across distributed setup
  • Total GPU Hours: 16 GPU hours (4 hours Γ— 4 GPUs)
  • Distributed Strategy: DeepSpeed ZeRO-3 for memory optimization and model sharding
  • Monitoring: Weights & Biases integration with tqdm progress tracking

Training Configuration

# Key Training Parameters
- Base Model: Qwen2-VL-7B-Instruct
- LoRA Rank: 4-8 (vision-language projector)
- QLoRA: 4-bit quantization (LLM blocks)
- Batch Size: 32-64 effective
- Precision: bf16 mixed precision
- Optimizer: AdamW with gradient accumulation
- Epochs: 3
- Training Duration: 4 hours wall-clock (16 GPU hours total)
- GPU Type: 4Γ—NVIDIA A6000 (24GB each)

πŸ“Š Training Details

Dataset

Trained on ScienceQA - a comprehensive multi-subject dataset with strong physics representation.

Data Format:

{
  "image": "path/to/physics_problem.jpg",
  "question": "Calculate the acceleration of the object...",
  "answer": "The acceleration is 9.8 m/sΒ² because..."
}

Training Metrics

Training Efficiency:

  • Total Training Time: 4 hours wall-clock time
  • Total GPU Hours: 16 GPU hours (4Γ—A6000 for 4 hours)
  • GPU Type: NVIDIA A6000 (24GB VRAM each)
  • Distributed Computing: DeepSpeed ZeRO-3 for efficient multi-GPU training
  • GPU Utilization: Optimized with gradient accumulation and mixed precision
  • Memory Efficiency: QLoRA reduces memory footprint by ~60%
  • Convergence: Stable training with consistent loss reduction across 3 epochs

Add Image Here: GPU utilization and memory usage charts

Prompt Engineering

The model uses a specialized Physics-tutor system prompt with structured JSON output formatting for consistent response generation and auto-grading compatibility.

Quick Start

Installation

pip install transformers torch peft bitsandbytes accelerate deepspeed wandb

Usage

from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel
import torch

# Load base model and adapter
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
model = PeftModel.from_pretrained(model, "parani01/Fine-tuned-physics-VLM-on-LoRA-and-QLoRA")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Process image and question
def solve_physics_problem(image_path, question):
    # Implementation here
    pass

πŸ“ˆ Performance & Results

Evaluation Highlights

  • OCR Accuracy: Significant improvement on mathematical expressions
  • Reasoning Quality: Enhanced step-by-step problem solving
  • Response Structure: Consistent JSON formatting for automated evaluation
  • Multi-modal Coherence: Better integration of visual and textual information

Technical Specifications

Model Architecture

  • Base: Qwen2-VL-7B-Instruct (7B parameters)
  • Vision Encoder: Frozen (preserves pre-trained representations)
  • Language Model: QLoRA fine-tuned (4-bit quantization)
  • Projector: Tiny LoRA adapted (rank 4-8)

Compute Requirements

  • Training Hardware: 4Γ—NVIDIA A6000 (24GB VRAM each)
  • Training Time: 4 hours wall-clock (16 total GPU hours)
  • Distributed Framework: DeepSpeed ZeRO-3 for memory-efficient multi-GPU training
  • Inference: Single A6000/RTX 4090 compatible
  • Memory: ~12GB VRAM for inference with quantization
  • Scalability: H200 support planned for future training iterations

Framework Versions

  • PEFT: 0.17.1
  • Transformers: Latest compatible version
  • DeepSpeed: ZeRO-3 configuration
  • PyTorch: 2.0+ with CUDA support
  • Weights & Biases: For training monitoring and logging

πŸŽ“ Use Cases

Direct Use

  • Educational Physics Tutoring: Step-by-step problem solving assistance
  • OCR + Reasoning: Extract and solve physics problems from images
  • Auto-grading Systems: JSON structured outputs for automated evaluation

Downstream Use

  • Physics Problem Analysis: Large-scale problem dataset processing
  • Educational AI Research: Benchmark for vision-language physics understanding
  • Homework Assistance Tools: Integration into educational platforms

Out-of-Scope Use

  • Non-physics domains: Model is specifically optimized for physics problems
  • Non-English languages: Training focused on English-language content
  • Production without GPU: Requires GPU acceleration for optimal performance

⚠️ Bias, Risks, and Limitations

Limitations

  • Domain Specificity: Optimized primarily for physics problems
  • Language Support: English-focused training data
  • Computational Requirements: Requires GPU for optimal performance
  • Dataset Bias: Limited to ScienceQA dataset characteristics

Recommendations

Users should be aware of the physics-focused training and may need additional fine-tuning for other scientific domains. GPU acceleration is recommended for production use.

πŸš€ How to Get Started with the Model

Basic Usage

from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel

# Load the fine-tuned model
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
model = PeftModel.from_pretrained(base_model, "parani01/Fine-tuned-physics-VLM-on-LoRA-and-QLoRA")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Your physics problem solving code here

πŸ“Š Training Details

Training Data

  • Dataset: ScienceQA multi-subject dataset
  • Format: JSONL with image, question, answer triplets
  • Focus: Physics problems with visual components
  • Size: Optimized subset for physics reasoning

Training Procedure

Training Hyperparameters

  • Training regime: Mixed precision bf16 with gradient accumulation
  • Batch Size: 32-64 effective batch size across 4 GPUs
  • Learning Rate: Optimized for LoRA/QLoRA setup
  • Epochs: 3 epochs
  • Distributed: DeepSpeed ZeRO-3 across 4Γ—A6000

Speeds, Sizes, Times

  • Training Time: 4 hours wall-clock
  • Total GPU Hours: 16 hours (4Γ—A6000)
  • Model Size: Base 7B + LoRA adapters
  • Memory Usage: ~60% reduction with QLoRA quantization

Training Carbon Footprint

  • Hardware Type: 4Γ—NVIDIA A6000 (24GB each)
  • Hours used: 16 total GPU hours
  • Training Duration: 4 hours wall-clock time
  • Efficiency: QLoRA quantization reduces computational overhead
  • Optimization: DeepSpeed ZeRO-3 for memory-efficient distributed training

Carbon emissions can be estimated using the Machine Learning Impact calculator.

πŸ“š Citation

If you use this model in your research, please cite:

@misc{physics-vlm-lora-qlora-2025,
  title={Fine-tuned Physics VLM: Multi-Adapter Training with LoRA and QLoRA for Enhanced Mathematical Reasoning},
  author={Paranidharan},
  year={2025},
  url={https://huggingface.co/parani01/Fine-tuned-physics-VLM-on-LoRA-and-QLoRA},
  note={Fine-tuned on ScienceQA using DeepSpeed ZeRO-3 distributed training}
}

πŸ”— Model Card Contact & Authors

Model Card Authors: Paranidharan
Contact: Available through Hugging Face model repository

Related Resources


Model Type: Vision-Language Model with Multi-Adapter Fine-tuning
Training Date: September 2025
Languages: English
Domains: Physics, Mathematics, Science Education
Hardware: 4Γ—NVIDIA A6000, DeepSpeed ZeRO-3 Distributed Training

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for parani01/Fine-tuned-physics-VLM-on-LoRA-and-QLoRA

Base model

Qwen/Qwen2-VL-7B
Adapter
(186)
this model

Dataset used to train parani01/Fine-tuned-physics-VLM-on-LoRA-and-QLoRA