File size: 6,128 Bytes
d56ab69
70c9837
d56ab69
70c9837
d56ab69
 
 
 
b405367
 
 
 
 
 
d56ab69
2c0a8be
 
d56ab69
 
70c9837
b405367
70c9837
b405367
 
 
 
 
 
 
 
 
 
 
 
 
 
70c9837
 
b405367
 
 
 
 
 
 
 
70c9837
 
b405367
 
 
70c9837
 
b405367
 
 
70c9837
 
b405367
 
 
70c9837
 
b405367
 
 
70c9837
 
d56ab69
 
 
 
 
b405367
 
 
 
 
70c9837
99590d6
b405367
 
 
 
 
 
 
 
 
 
 
d56ab69
b405367
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d56ab69
 
b405367
 
 
 
 
 
 
 
d56ab69
 
70c9837
d56ab69
b405367
d56ab69
b405367
 
 
 
 
70c9837
b405367
 
 
 
 
 
 
 
 
 
 
 
 
d56ab69
 
 
 
 
 
 
 
 
b405367
 
 
 
 
 
d56ab69
 
b405367
d56ab69
b405367
d56ab69
b405367
 
 
 
 
 
 
 
 
 
d56ab69
 
 
 
b405367
d56ab69
 
 
 
 
b405367
 
 
 
2c0a8be
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
base_model: Qwen/Qwen3-VL-2B-Instruct
library_name: transformers
model_name: Qwen3-VL-2B-catmus-medieval
tags:
- generated_from_trainer
- sft
- trl
- vision-language
- ocr
- transcription
- medieval
- latin
- manuscript
licence: license
datasets:
- CATMuS/medieval
---

# Model Card for Qwen3-VL-2B-catmus-medieval

This model is a fine-tuned version of [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) for transcribing line-level medieval manuscripts from images.
It has been trained using [TRL](https://github.com/huggingface/trl) on the [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval) dataset.

## Model Description

This vision-language model specializes in transcribing text from images of line-level medieval manuscripts. Given an image of manuscript text, the model generates the corresponding transcription.

## Performance

The model was evaluated on 100 examples from the [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval) dataset (test split).

### Metrics

| Metric | Base Model | Fine-tuned Model | Improvement |
|--------|-----------|------------------|-------------|
| **Character Error Rate (CER)** | 1.0815 (108.15%) | 0.2779 (27.79%) | **+74.30%** |
| **Word Error Rate (WER)** | 1.7386 (173.86%) | 0.7043 (70.43%) | **+59.49%** |

### Sample Predictions

Here are some example transcriptions comparing the base model and fine-tuned model:


**Example 1:**
- **Reference:** paulꝯ ad thessalonicenses .iii.
- **Base Model:** Paulus ad the Malomancis · iii.
- **Fine-tuned Model:** Paulꝰ ad thessalonensis .iii.

**Example 2:**
- **Reference:** acceptad mi humilde seruicio. e dissipad. e plantad en el
- **Base Model:**  acceptad mi humilde servicio, e dissipad, e plantad en el
- **Fine-tuned Model:** acceptad mi humilde seruicio, e dissipad, e plantad en el

**Example 3:**
- **Reference:** ꝙ mattheus illam dictionem ponat
- **Base Model:**  p mattheus illam dictoneum proa
- **Fine-tuned Model:** ꝑ mattheus illam dictione in ponat

**Example 4:**
- **Reference:** Elige ꝗd uoueas. eadẽ ħ ꝗꝗ sama ferebat.
- **Base Model:** f. ligeq d uonear. eade h q q fama ferebat.
- **Fine-tuned Model:** f liges ꝗd uonear. eadẽ li ꝗq tanta ferebat᷑.

**Example 5:**
- **Reference:** a prima coniugatione ue
- **Base Model:** Grigimacopissagazione-ve
- **Fine-tuned Model:** a ꝑrũt̾tacõnueꝰatione. ne


## Quick start

```python
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from peft import PeftModel
from PIL import Image

# Load model and processor
base_model = "Qwen/Qwen3-VL-2B-Instruct"
adapter_model = "small-models-for-glam/Qwen3-VL-2B-catmus"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    base_model,
    torch_dtype="auto",
    device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_model)
processor = AutoProcessor.from_pretrained(base_model)

# Load your image
image = Image.open("path/to/your/manuscript_image.jpg")

# Prepare the message
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Transcribe the text shown in this image."},
        ],
    },
]

# Generate transcription
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
transcription = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

print(transcription)
```

## Use Cases

This model is designed for:
- Transcribing line-level medieval manuscripts
- Digitizing historical manuscripts
- Supporting historical research and archival work
- Optical Character Recognition (OCR) for specialized historical texts

## Training procedure

This model was fine-tuned using Supervised Fine-Tuning (SFT) with LoRA adapters on the Qwen3-VL-2B-Instruct base model.

### Training Data

The model was trained on [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval), 
a dataset containing images of line-level medieval manuscripts with corresponding text transcriptions.

### Training Configuration

- **Base Model**: Qwen/Qwen3-VL-2B-Instruct
- **Training Method**: Supervised Fine-Tuning (SFT) with LoRA
- **LoRA Configuration**:
  - Rank (r): 16
  - Alpha: 32
  - Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  - Dropout: 0.1
- **Training Arguments**:
  - Epochs: 3
  - Batch size per device: 2
  - Gradient accumulation steps: 4
  - Learning rate: 5e-05
  - Optimizer: AdamW
  - Mixed precision: FP16

### Framework versions

- TRL: 0.23.0
- Transformers: 4.57.1
- Pytorch: 2.8.0
- Datasets: 4.1.1
- Tokenizers: 0.22.1

## Limitations

- The model is specialized for line-level medieval manuscripts and may not perform well on other types of text or images
- Performance may vary depending on image quality, resolution, and handwriting style
- The model has been trained on a specific dataset and may require fine-tuning for other manuscript collections

## Citations

If you use this model, please cite the base model and training framework:

### Qwen3-VL

```bibtex
@article{Qwen3-VL,
  title={Qwen3-VL: Large Vision Language Models Pretrained on Massive Data},
  author={Qwen Team},
  journal={arXiv preprint},
  year={2024}
}
```

### TRL (Transformer Reinforcement Learning)
    
```bibtex
@misc{vonwerra2022trl,
	title        = {{TRL: Transformer Reinforcement Learning}},
	author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'{'e}}dec},
	year         = 2020,
	journal      = {GitHub repository},
	publisher    = {GitHub},
	howpublished = {\url{https://github.com/huggingface/trl}}
}
```

---

*README generated automatically on 2025-10-24 10:49:05*