Qwen3-VL-2B-catmus / README.md

Add dataset to YAML

2c0a8be verified 21 days ago

6.13 kB

	---
	base_model: Qwen/Qwen3-VL-2B-Instruct
	library_name: transformers
	model_name: Qwen3-VL-2B-catmus-medieval
	tags:
	- generated_from_trainer
	- sft
	- trl
	- vision-language
	- ocr
	- transcription
	- medieval
	- latin
	- manuscript
	licence: license
	datasets:
	- CATMuS/medieval
	---

	# Model Card for Qwen3-VL-2B-catmus-medieval

	This model is a fine-tuned version of [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) for transcribing line-level medieval manuscripts from images.
	It has been trained using [TRL](https://github.com/huggingface/trl) on the [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval) dataset.

	## Model Description

	This vision-language model specializes in transcribing text from images of line-level medieval manuscripts. Given an image of manuscript text, the model generates the corresponding transcription.

	## Performance

	The model was evaluated on 100 examples from the [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval) dataset (test split).

	### Metrics

	\| Metric \| Base Model \| Fine-tuned Model \| Improvement \|
	\|--------\|-----------\|------------------\|-------------\|
	\| Character Error Rate (CER) \| 1.0815 (108.15%) \| 0.2779 (27.79%) \| +74.30% \|
	\| Word Error Rate (WER) \| 1.7386 (173.86%) \| 0.7043 (70.43%) \| +59.49% \|

	### Sample Predictions

	Here are some example transcriptions comparing the base model and fine-tuned model:


	Example 1:
	- Reference: paulꝯ ad thessalonicenses .iii.
	- Base Model: Paulus ad the Malomancis · iii.
	- Fine-tuned Model: Paulꝰ ad thessalonensis .iii.

	Example 2:
	- Reference: acceptad mi humilde seruicio. e dissipad. e plantad en el
	- Base Model: acceptad mi humilde servicio, e dissipad, e plantad en el
	- Fine-tuned Model: acceptad mi humilde seruicio, e dissipad, e plantad en el

	Example 3:
	- Reference: ꝙ mattheus illam dictionem ponat
	- Base Model: p mattheus illam dictoneum proa
	- Fine-tuned Model: ꝑ mattheus illam dictione in ponat

	Example 4:
	- Reference: Elige ꝗd uoueas. eadẽ ħ ꝗꝗ sama ferebat.
	- Base Model: f. ligeq d uonear. eade h q q fama ferebat.
	- Fine-tuned Model: f liges ꝗd uonear. eadẽ li ꝗq tanta ferebat᷑.

	Example 5:
	- Reference: a prima coniugatione ue
	- Base Model: Grigimacopissagazione-ve
	- Fine-tuned Model: a ꝑrũt̾tacõnueꝰatione. ne


	## Quick start

	```python
	from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
	from peft import PeftModel
	from PIL import Image

	# Load model and processor
	base_model = "Qwen/Qwen3-VL-2B-Instruct"
	adapter_model = "small-models-for-glam/Qwen3-VL-2B-catmus"

	model = Qwen3VLForConditionalGeneration.from_pretrained(
	base_model,
	torch_dtype="auto",
	device_map="auto"
	)
	model = PeftModel.from_pretrained(model, adapter_model)
	processor = AutoProcessor.from_pretrained(base_model)

	# Load your image
	image = Image.open("path/to/your/manuscript_image.jpg")

	# Prepare the message
	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": image},
	{"type": "text", "text": "Transcribe the text shown in this image."},
	],
	},
	]

	# Generate transcription
	inputs = processor.apply_chat_template(
	messages,
	tokenize=True,
	add_generation_prompt=True,
	return_dict=True,
	return_tensors="pt"
	).to(model.device)

	generated_ids = model.generate(**inputs, max_new_tokens=256)
	generated_ids_trimmed = [
	out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	transcription = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)[0]

	print(transcription)
	```

	## Use Cases

	This model is designed for:
	- Transcribing line-level medieval manuscripts
	- Digitizing historical manuscripts
	- Supporting historical research and archival work
	- Optical Character Recognition (OCR) for specialized historical texts

	## Training procedure

	This model was fine-tuned using Supervised Fine-Tuning (SFT) with LoRA adapters on the Qwen3-VL-2B-Instruct base model.

	### Training Data

	The model was trained on [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval),
	a dataset containing images of line-level medieval manuscripts with corresponding text transcriptions.

	### Training Configuration

	- Base Model: Qwen/Qwen3-VL-2B-Instruct
	- Training Method: Supervised Fine-Tuning (SFT) with LoRA
	- LoRA Configuration:
	- Rank (r): 16
	- Alpha: 32
	- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
	- Dropout: 0.1
	- Training Arguments:
	- Epochs: 3
	- Batch size per device: 2
	- Gradient accumulation steps: 4
	- Learning rate: 5e-05
	- Optimizer: AdamW
	- Mixed precision: FP16

	### Framework versions

	- TRL: 0.23.0
	- Transformers: 4.57.1
	- Pytorch: 2.8.0
	- Datasets: 4.1.1
	- Tokenizers: 0.22.1

	## Limitations

	- The model is specialized for line-level medieval manuscripts and may not perform well on other types of text or images
	- Performance may vary depending on image quality, resolution, and handwriting style
	- The model has been trained on a specific dataset and may require fine-tuning for other manuscript collections

	## Citations

	If you use this model, please cite the base model and training framework:

	### Qwen3-VL

	```bibtex
	@article{Qwen3-VL,
	title={Qwen3-VL: Large Vision Language Models Pretrained on Massive Data},
	author={Qwen Team},
	journal={arXiv preprint},
	year={2024}
	}
	```

	### TRL (Transformer Reinforcement Learning)

	```bibtex
	@misc{vonwerra2022trl,
	title = {{TRL: Transformer Reinforcement Learning}},
	author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'{'e}}dec},
	year = 2020,
	journal = {GitHub repository},
	publisher = {GitHub},
	howpublished = {\url{https://github.com/huggingface/trl}}
	}
	```

	---

	README generated automatically on 2025-10-24 10:49:05