Mayank022
/

qwen2-vl-finetuned-Image-to-LaTeX

text-generation-inference

Model card Files Files and versions

qwen2-vl-finetuned-Image-to-LaTeX / README.md

Mayank022's picture

updated the model card

e9fb5a1 verified 8 months ago

|

history blame contribute delete

3.42 kB

	---
	base_model: unsloth/qwen2-vl-7b-instruct-unsloth-bnb-4bit
	tags:
	- text-generation-inference
	- transformers
	- unsloth
	- qwen2_vl
	- trl
	license: apache-2.0
	language:
	- en
	datasets:
	- unsloth/LaTeX_OCR
	library_name: unsloth
	model_name: Qwen2-VL-7B-Instruct with LoRA (Equation-to-LaTeX)
	---

	# Qwen2-VL: Equation Image → LaTeX with LoRA + Unsloth




	Fine-tune [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), a Vision-Language model, to convert equation images into LaTeX code using the [Unsloth](https://github.com/unslothai/unsloth) framework and LoRA adapters.


	## Project Objective

	Train an Equation-to-LaTeX transcriber using a pre-trained multimodal model. The model learns to read rendered math equations and generate corresponding LaTeX.


	![image/gif](https://cdn-uploads.huggingface.co/production/uploads/666c3d6489e21df7d4a02805/zVB5_lPq5v8EeHRbpSLtE.gif)


	---
	[Source code on github ](https://github.com/Mayankpratapsingh022/Finetuning-LLMs/tree/main/Qwen_2_VL_Multimodel_LLM_Finetuning)

	## Dataset

	- [`unsloth/LaTeX_OCR`](https://huggingface.co/datasets/unsloth/LaTeX_OCR) – Image-LaTeX pairs of printed mathematical expressions.
	- ~68K train / 7K test samples.
	- Example:
	- Image: ![image](https://github.com/user-attachments/assets/e0d87582-7ba4-4e59-8f00-fd8f6c0f862d)
	- Target: `R - { \frac { 1 } { 2 } } ( \nabla \Phi ) ^ { 2 } - { \frac { 1 } { 2 } } \nabla ^ { 2 } \Phi = 0 .`

	---

	## Tech Stack

	\| Component \| Description \|
	\|----------\|-------------\|
	\| Qwen2-VL \| Multimodal vision-language model (7B) by Alibaba \|
	\| Unsloth \| Fast & memory-efficient training \|
	\| LoRA (via PEFT) \| Parameter-efficient fine-tuning \|
	\| 4-bit Quantization \| Enabled by `bitsandbytes` \|
	\| Datasets, HF Hub \| For loading/saving models & datasets \|

	---

	## Setup

	```bash
	pip install unsloth unsloth_zoo peft trl datasets accelerate bitsandbytes xformers==0.0.29.post3 sentencepiece protobuf hf_transfer triton
	```

	---

	## Training (Jupyter Notebook)

	Refer to: `Qwen2__VL_image_to_latext.ipynb`

	Steps:
	1. Load Qwen2-VL (`load_in_4bit=True`)
	2. Load dataset via `datasets.load_dataset("unsloth/LaTeX_OCR")`
	3. Apply LoRA adapters
	4. Use `SFTTrainer` from Unsloth to fine-tune
	5. Save adapters or merged model

	LoRA rank used: `r=16`
	LoRA alpha: `16`

	---

	## Inference

	```python
	from PIL import Image
	image = Image.open("equation.png")
	prompt = "Write the LaTeX representation for this image."
	inputs = tokenizer(image, tokenizer.apply_chat_template([("user", prompt)], add_generation_prompt=True), return_tensors="pt").to("cuda")
	output = model.generate(**inputs, max_new_tokens=128)
	print(tokenizer.decode(output[0], skip_special_tokens=True))
	```

	---

	## Evaluation

	- Exact Match Accuracy: ~90%+
	- Strong generalization to complex equations and symbols

	---

	## Results

	\| Metric \| Value \|
	\|------------------\|---------------\|
	\| Exact Match \| ~90–92% \|
	\| LoRA Params \| ~<1% of model \|
	\| Training Time \| ~20–40 mins on A100 \|
	\| Model Size \| 7B (4-bit) \|

	---

	## Future Work

	- Extend to handwritten formulas (e.g., CROHME dataset)
	- Add LaTeX syntax validation or auto-correction
	- Build a lightweight Gradio/Streamlit interface for demo

	---

	## Folder Structure

	```
	.
	├── Qwen2__VL_image_to_latext.ipynb # Training Notebook
	├── output/ # Saved fine-tuned model
	└── README.md
	```

	---