Mayank022's picture
updated the model card
e9fb5a1 verified
---
base_model: unsloth/qwen2-vl-7b-instruct-unsloth-bnb-4bit
tags:
- text-generation-inference
- transformers
- unsloth
- qwen2_vl
- trl
license: apache-2.0
language:
- en
datasets:
- unsloth/LaTeX_OCR
library_name: unsloth
model_name: Qwen2-VL-7B-Instruct with LoRA (Equation-to-LaTeX)
---
# Qwen2-VL: Equation Image β†’ LaTeX with LoRA + Unsloth
Fine-tune [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), a Vision-Language model, to convert equation images into LaTeX code using the [Unsloth](https://github.com/unslothai/unsloth) framework and LoRA adapters.
## Project Objective
Train an Equation-to-LaTeX transcriber using a pre-trained multimodal model. The model learns to read rendered math equations and generate corresponding LaTeX.
![image/gif](https://cdn-uploads.huggingface.co/production/uploads/666c3d6489e21df7d4a02805/zVB5_lPq5v8EeHRbpSLtE.gif)
---
[Source code on github ](https://github.com/Mayankpratapsingh022/Finetuning-LLMs/tree/main/Qwen_2_VL_Multimodel_LLM_Finetuning)
## Dataset
- [`unsloth/LaTeX_OCR`](https://huggingface.co/datasets/unsloth/LaTeX_OCR) – Image-LaTeX pairs of printed mathematical expressions.
- ~68K train / 7K test samples.
- Example:
- Image: ![image](https://github.com/user-attachments/assets/e0d87582-7ba4-4e59-8f00-fd8f6c0f862d)
- Target: `R - { \frac { 1 } { 2 } } ( \nabla \Phi ) ^ { 2 } - { \frac { 1 } { 2 } } \nabla ^ { 2 } \Phi = 0 .`
---
## Tech Stack
| Component | Description |
|----------|-------------|
| Qwen2-VL | Multimodal vision-language model (7B) by Alibaba |
| Unsloth | Fast & memory-efficient training |
| LoRA (via PEFT) | Parameter-efficient fine-tuning |
| 4-bit Quantization | Enabled by `bitsandbytes` |
| Datasets, HF Hub | For loading/saving models & datasets |
---
## Setup
```bash
pip install unsloth unsloth_zoo peft trl datasets accelerate bitsandbytes xformers==0.0.29.post3 sentencepiece protobuf hf_transfer triton
```
---
## Training (Jupyter Notebook)
Refer to: `Qwen2__VL_image_to_latext.ipynb`
Steps:
1. Load Qwen2-VL (`load_in_4bit=True`)
2. Load dataset via `datasets.load_dataset("unsloth/LaTeX_OCR")`
3. Apply LoRA adapters
4. Use `SFTTrainer` from Unsloth to fine-tune
5. Save adapters or merged model
LoRA rank used: `r=16`
LoRA alpha: `16`
---
## Inference
```python
from PIL import Image
image = Image.open("equation.png")
prompt = "Write the LaTeX representation for this image."
inputs = tokenizer(image, tokenizer.apply_chat_template([("user", prompt)], add_generation_prompt=True), return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
---
## Evaluation
- Exact Match Accuracy: ~90%+
- Strong generalization to complex equations and symbols
---
## Results
| Metric | Value |
|------------------|---------------|
| Exact Match | ~90–92% |
| LoRA Params | ~<1% of model |
| Training Time | ~20–40 mins on A100 |
| Model Size | 7B (4-bit) |
---
## Future Work
- Extend to handwritten formulas (e.g., CROHME dataset)
- Add LaTeX syntax validation or auto-correction
- Build a lightweight Gradio/Streamlit interface for demo
---
## Folder Structure
```
.
β”œβ”€β”€ Qwen2__VL_image_to_latext.ipynb # Training Notebook
β”œβ”€β”€ output/ # Saved fine-tuned model
└── README.md
```
---