|
|
--- |
|
|
base_model: Qwen/Qwen3-VL-2B-Instruct |
|
|
library_name: transformers |
|
|
model_name: Qwen3-VL-2B-catmus-medieval |
|
|
tags: |
|
|
- generated_from_trainer |
|
|
- sft |
|
|
- trl |
|
|
- vision-language |
|
|
- ocr |
|
|
- transcription |
|
|
- medieval |
|
|
- latin |
|
|
- manuscript |
|
|
licence: license |
|
|
datasets: |
|
|
- CATMuS/medieval |
|
|
--- |
|
|
|
|
|
# Model Card for Qwen3-VL-2B-catmus-medieval |
|
|
|
|
|
This model is a fine-tuned version of [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) for transcribing line-level medieval manuscripts from images. |
|
|
It has been trained using [TRL](https://github.com/huggingface/trl) on the [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval) dataset. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This vision-language model specializes in transcribing text from images of line-level medieval manuscripts. Given an image of manuscript text, the model generates the corresponding transcription. |
|
|
|
|
|
## Performance |
|
|
|
|
|
The model was evaluated on 100 examples from the [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval) dataset (test split). |
|
|
|
|
|
### Metrics |
|
|
|
|
|
| Metric | Base Model | Fine-tuned Model | Improvement | |
|
|
|--------|-----------|------------------|-------------| |
|
|
| **Character Error Rate (CER)** | 1.0815 (108.15%) | 0.2779 (27.79%) | **+74.30%** | |
|
|
| **Word Error Rate (WER)** | 1.7386 (173.86%) | 0.7043 (70.43%) | **+59.49%** | |
|
|
|
|
|
### Sample Predictions |
|
|
|
|
|
Here are some example transcriptions comparing the base model and fine-tuned model: |
|
|
|
|
|
|
|
|
**Example 1:** |
|
|
- **Reference:** paulꝯ ad thessalonicenses .iii. |
|
|
- **Base Model:** Paulus ad the Malomancis · iii. |
|
|
- **Fine-tuned Model:** Paulꝰ ad thessalonensis .iii. |
|
|
|
|
|
**Example 2:** |
|
|
- **Reference:** acceptad mi humilde seruicio. e dissipad. e plantad en el |
|
|
- **Base Model:** acceptad mi humilde servicio, e dissipad, e plantad en el |
|
|
- **Fine-tuned Model:** acceptad mi humilde seruicio, e dissipad, e plantad en el |
|
|
|
|
|
**Example 3:** |
|
|
- **Reference:** ꝙ mattheus illam dictionem ponat |
|
|
- **Base Model:** p mattheus illam dictoneum proa |
|
|
- **Fine-tuned Model:** ꝑ mattheus illam dictione in ponat |
|
|
|
|
|
**Example 4:** |
|
|
- **Reference:** Elige ꝗd uoueas. eadẽ ħ ꝗꝗ sama ferebat. |
|
|
- **Base Model:** f. ligeq d uonear. eade h q q fama ferebat. |
|
|
- **Fine-tuned Model:** f liges ꝗd uonear. eadẽ li ꝗq tanta ferebat᷑. |
|
|
|
|
|
**Example 5:** |
|
|
- **Reference:** a prima coniugatione ue |
|
|
- **Base Model:** Grigimacopissagazione-ve |
|
|
- **Fine-tuned Model:** a ꝑrũt̾tacõnueꝰatione. ne |
|
|
|
|
|
|
|
|
## Quick start |
|
|
|
|
|
```python |
|
|
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration |
|
|
from peft import PeftModel |
|
|
from PIL import Image |
|
|
|
|
|
# Load model and processor |
|
|
base_model = "Qwen/Qwen3-VL-2B-Instruct" |
|
|
adapter_model = "small-models-for-glam/Qwen3-VL-2B-catmus" |
|
|
|
|
|
model = Qwen3VLForConditionalGeneration.from_pretrained( |
|
|
base_model, |
|
|
torch_dtype="auto", |
|
|
device_map="auto" |
|
|
) |
|
|
model = PeftModel.from_pretrained(model, adapter_model) |
|
|
processor = AutoProcessor.from_pretrained(base_model) |
|
|
|
|
|
# Load your image |
|
|
image = Image.open("path/to/your/manuscript_image.jpg") |
|
|
|
|
|
# Prepare the message |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image", "image": image}, |
|
|
{"type": "text", "text": "Transcribe the text shown in this image."}, |
|
|
], |
|
|
}, |
|
|
] |
|
|
|
|
|
# Generate transcription |
|
|
inputs = processor.apply_chat_template( |
|
|
messages, |
|
|
tokenize=True, |
|
|
add_generation_prompt=True, |
|
|
return_dict=True, |
|
|
return_tensors="pt" |
|
|
).to(model.device) |
|
|
|
|
|
generated_ids = model.generate(**inputs, max_new_tokens=256) |
|
|
generated_ids_trimmed = [ |
|
|
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
|
] |
|
|
transcription = processor.batch_decode( |
|
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
|
)[0] |
|
|
|
|
|
print(transcription) |
|
|
``` |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
This model is designed for: |
|
|
- Transcribing line-level medieval manuscripts |
|
|
- Digitizing historical manuscripts |
|
|
- Supporting historical research and archival work |
|
|
- Optical Character Recognition (OCR) for specialized historical texts |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
This model was fine-tuned using Supervised Fine-Tuning (SFT) with LoRA adapters on the Qwen3-VL-2B-Instruct base model. |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was trained on [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval), |
|
|
a dataset containing images of line-level medieval manuscripts with corresponding text transcriptions. |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
- **Base Model**: Qwen/Qwen3-VL-2B-Instruct |
|
|
- **Training Method**: Supervised Fine-Tuning (SFT) with LoRA |
|
|
- **LoRA Configuration**: |
|
|
- Rank (r): 16 |
|
|
- Alpha: 32 |
|
|
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
|
|
- Dropout: 0.1 |
|
|
- **Training Arguments**: |
|
|
- Epochs: 3 |
|
|
- Batch size per device: 2 |
|
|
- Gradient accumulation steps: 4 |
|
|
- Learning rate: 5e-05 |
|
|
- Optimizer: AdamW |
|
|
- Mixed precision: FP16 |
|
|
|
|
|
### Framework versions |
|
|
|
|
|
- TRL: 0.23.0 |
|
|
- Transformers: 4.57.1 |
|
|
- Pytorch: 2.8.0 |
|
|
- Datasets: 4.1.1 |
|
|
- Tokenizers: 0.22.1 |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- The model is specialized for line-level medieval manuscripts and may not perform well on other types of text or images |
|
|
- Performance may vary depending on image quality, resolution, and handwriting style |
|
|
- The model has been trained on a specific dataset and may require fine-tuning for other manuscript collections |
|
|
|
|
|
## Citations |
|
|
|
|
|
If you use this model, please cite the base model and training framework: |
|
|
|
|
|
### Qwen3-VL |
|
|
|
|
|
```bibtex |
|
|
@article{Qwen3-VL, |
|
|
title={Qwen3-VL: Large Vision Language Models Pretrained on Massive Data}, |
|
|
author={Qwen Team}, |
|
|
journal={arXiv preprint}, |
|
|
year={2024} |
|
|
} |
|
|
``` |
|
|
|
|
|
### TRL (Transformer Reinforcement Learning) |
|
|
|
|
|
```bibtex |
|
|
@misc{vonwerra2022trl, |
|
|
title = {{TRL: Transformer Reinforcement Learning}}, |
|
|
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'{'e}}dec}, |
|
|
year = 2020, |
|
|
journal = {GitHub repository}, |
|
|
publisher = {GitHub}, |
|
|
howpublished = {\url{https://github.com/huggingface/trl}} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
*README generated automatically on 2025-10-24 10:49:05* |