File size: 6,128 Bytes
d56ab69 70c9837 d56ab69 70c9837 d56ab69 b405367 d56ab69 2c0a8be d56ab69 70c9837 b405367 70c9837 b405367 70c9837 b405367 70c9837 b405367 70c9837 b405367 70c9837 b405367 70c9837 b405367 70c9837 d56ab69 b405367 70c9837 99590d6 b405367 d56ab69 b405367 d56ab69 b405367 d56ab69 70c9837 d56ab69 b405367 d56ab69 b405367 70c9837 b405367 d56ab69 b405367 d56ab69 b405367 d56ab69 b405367 d56ab69 b405367 d56ab69 b405367 d56ab69 b405367 2c0a8be |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 |
---
base_model: Qwen/Qwen3-VL-2B-Instruct
library_name: transformers
model_name: Qwen3-VL-2B-catmus-medieval
tags:
- generated_from_trainer
- sft
- trl
- vision-language
- ocr
- transcription
- medieval
- latin
- manuscript
licence: license
datasets:
- CATMuS/medieval
---
# Model Card for Qwen3-VL-2B-catmus-medieval
This model is a fine-tuned version of [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) for transcribing line-level medieval manuscripts from images.
It has been trained using [TRL](https://github.com/huggingface/trl) on the [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval) dataset.
## Model Description
This vision-language model specializes in transcribing text from images of line-level medieval manuscripts. Given an image of manuscript text, the model generates the corresponding transcription.
## Performance
The model was evaluated on 100 examples from the [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval) dataset (test split).
### Metrics
| Metric | Base Model | Fine-tuned Model | Improvement |
|--------|-----------|------------------|-------------|
| **Character Error Rate (CER)** | 1.0815 (108.15%) | 0.2779 (27.79%) | **+74.30%** |
| **Word Error Rate (WER)** | 1.7386 (173.86%) | 0.7043 (70.43%) | **+59.49%** |
### Sample Predictions
Here are some example transcriptions comparing the base model and fine-tuned model:
**Example 1:**
- **Reference:** paulꝯ ad thessalonicenses .iii.
- **Base Model:** Paulus ad the Malomancis · iii.
- **Fine-tuned Model:** Paulꝰ ad thessalonensis .iii.
**Example 2:**
- **Reference:** acceptad mi humilde seruicio. e dissipad. e plantad en el
- **Base Model:** acceptad mi humilde servicio, e dissipad, e plantad en el
- **Fine-tuned Model:** acceptad mi humilde seruicio, e dissipad, e plantad en el
**Example 3:**
- **Reference:** ꝙ mattheus illam dictionem ponat
- **Base Model:** p mattheus illam dictoneum proa
- **Fine-tuned Model:** ꝑ mattheus illam dictione in ponat
**Example 4:**
- **Reference:** Elige ꝗd uoueas. eadẽ ħ ꝗꝗ sama ferebat.
- **Base Model:** f. ligeq d uonear. eade h q q fama ferebat.
- **Fine-tuned Model:** f liges ꝗd uonear. eadẽ li ꝗq tanta ferebat᷑.
**Example 5:**
- **Reference:** a prima coniugatione ue
- **Base Model:** Grigimacopissagazione-ve
- **Fine-tuned Model:** a ꝑrũt̾tacõnueꝰatione. ne
## Quick start
```python
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from peft import PeftModel
from PIL import Image
# Load model and processor
base_model = "Qwen/Qwen3-VL-2B-Instruct"
adapter_model = "small-models-for-glam/Qwen3-VL-2B-catmus"
model = Qwen3VLForConditionalGeneration.from_pretrained(
base_model,
torch_dtype="auto",
device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_model)
processor = AutoProcessor.from_pretrained(base_model)
# Load your image
image = Image.open("path/to/your/manuscript_image.jpg")
# Prepare the message
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Transcribe the text shown in this image."},
],
},
]
# Generate transcription
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
transcription = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(transcription)
```
## Use Cases
This model is designed for:
- Transcribing line-level medieval manuscripts
- Digitizing historical manuscripts
- Supporting historical research and archival work
- Optical Character Recognition (OCR) for specialized historical texts
## Training procedure
This model was fine-tuned using Supervised Fine-Tuning (SFT) with LoRA adapters on the Qwen3-VL-2B-Instruct base model.
### Training Data
The model was trained on [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval),
a dataset containing images of line-level medieval manuscripts with corresponding text transcriptions.
### Training Configuration
- **Base Model**: Qwen/Qwen3-VL-2B-Instruct
- **Training Method**: Supervised Fine-Tuning (SFT) with LoRA
- **LoRA Configuration**:
- Rank (r): 16
- Alpha: 32
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Dropout: 0.1
- **Training Arguments**:
- Epochs: 3
- Batch size per device: 2
- Gradient accumulation steps: 4
- Learning rate: 5e-05
- Optimizer: AdamW
- Mixed precision: FP16
### Framework versions
- TRL: 0.23.0
- Transformers: 4.57.1
- Pytorch: 2.8.0
- Datasets: 4.1.1
- Tokenizers: 0.22.1
## Limitations
- The model is specialized for line-level medieval manuscripts and may not perform well on other types of text or images
- Performance may vary depending on image quality, resolution, and handwriting style
- The model has been trained on a specific dataset and may require fine-tuning for other manuscript collections
## Citations
If you use this model, please cite the base model and training framework:
### Qwen3-VL
```bibtex
@article{Qwen3-VL,
title={Qwen3-VL: Large Vision Language Models Pretrained on Massive Data},
author={Qwen Team},
journal={arXiv preprint},
year={2024}
}
```
### TRL (Transformer Reinforcement Learning)
```bibtex
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'{'e}}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
```
---
*README generated automatically on 2025-10-24 10:49:05* |