--- base_model: Qwen/Qwen3-VL-8B-Instruct library_name: transformers model_name: Qwen3-VL-8B-catmus-medieval tags: - generated_from_trainer - sft - trl - vision-language - ocr - transcription - medieval - latin - manuscript licence: license datasets: - CATMuS/medieval --- # Model Card for Qwen3-VL-8B-catmus-medieval This model is a fine-tuned version of [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) for transcribing medieval Latin manuscripts from images. It has been trained using [TRL](https://github.com/huggingface/trl) on the [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval) dataset. ## Model Description This vision-language model specializes in transcribing text from images of medieval Latin manuscripts. Given an image of manuscript text, the model generates the corresponding transcription. ## Performance The model was evaluated on 100 examples from the [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval) dataset (test split). ### Metrics | Metric | Base Model | Fine-tuned Model | Improvement | |--------|-----------|------------------|-------------| | **Character Error Rate (CER)** | 0.3778 (37.78%) | 0.1997 (19.97%) | **+47.14%** | | **Word Error Rate (WER)** | 0.8300 (83.00%) | 0.5457 (54.57%) | **+34.25%** | ### Sample Predictions Here are some example transcriptions comparing the base model and fine-tuned model: **Example 1:** - **Reference:** paulꝯ ad thessalonicenses .iii. - **Base Model:** paul9adthellalomconceB·iii· - **Fine-tuned Model:** paulꝰ ad thessalonicenses .iii. **Example 2:** - **Reference:** acceptad mi humilde seruicio. e dissipad. e plantad en el - **Base Model:** acceptad mi humilde servicio, è dissipad, è plantat en el - **Fine-tuned Model:** acceptad mi humilde seruicio. e dissipad. e plantad en el **Example 3:** - **Reference:** ꝙ mattheus illam dictionem ponat - **Base Model:** q mattheus illam dictionem ponat - **Fine-tuned Model:** ꝙ mattheus illam dictiõnem ponat **Example 4:** - **Reference:** Elige ꝗd uoueas. eadẽ ħ ꝗꝗ sama ferebat. - **Base Model:** fuge quoniam cade hic quia tama ferebar. - **Fine-tuned Model:** Fuge qd̾ uoneas. eadẽ ħ ꝗꝗ sana ferebat: **Example 5:** - **Reference:** a prima coniugatione ue - **Base Model:** aprimaconiugazioneue - **Fine-tuned Model:** a prima coniugatione ue ## Quick start ```python from transformers import AutoProcessor, Qwen3VLForConditionalGeneration from peft import PeftModel from PIL import Image # Load model and processor base_model = "Qwen/Qwen3-VL-8B-Instruct" adapter_model = "small-models-for-glam/Qwen3-VL-8B-catmus" model = Qwen3VLForConditionalGeneration.from_pretrained( base_model, torch_dtype="auto", device_map="auto" ) model = PeftModel.from_pretrained(model, adapter_model) processor = AutoProcessor.from_pretrained(base_model) # Load your image image = Image.open("path/to/your/manuscript_image.jpg") # Prepare the message messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "Transcribe the text shown in this image."}, ], }, ] # Generate transcription inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt" ).to(model.device) generated_ids = model.generate(**inputs, max_new_tokens=256) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] transcription = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(transcription) ``` ## Use Cases This model is designed for: - Transcribing medieval Latin manuscripts - Digitizing historical manuscripts - Supporting historical research and archival work - Optical Character Recognition (OCR) for specialized historical texts ## Training procedure This model was fine-tuned using Supervised Fine-Tuning (SFT) with LoRA adapters on the Qwen3-VL-8B-Instruct base model. ### Training Data The model was trained on [CATMuS/medieval](https://huggingface.co/datasets/CATMuS/medieval), a dataset containing images of medieval Latin manuscripts with corresponding text transcriptions. ### Training Configuration - **Base Model**: Qwen/Qwen3-VL-8B-Instruct - **Training Method**: Supervised Fine-Tuning (SFT) with LoRA - **LoRA Configuration**: - Rank (r): 16 - Alpha: 32 - Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj - Dropout: 0.1 - **Training Arguments**: - Epochs: 3 - Batch size per device: 2 - Gradient accumulation steps: 4 - Learning rate: 5e-05 - Optimizer: AdamW - Mixed precision: FP16 ### Framework versions - TRL: 0.23.0 - Transformers: 4.57.1 - Pytorch: 2.8.0 - Datasets: 4.1.1 - Tokenizers: 0.22.1 ## Limitations - The model is specialized for medieval Latin manuscripts and may not perform well on other types of text or images - Performance may vary depending on image quality, resolution, and handwriting style - The model has been trained on a specific dataset and may require fine-tuning for other manuscript collections ## Citations If you use this model, please cite the base model and training framework: ### Qwen3-VL ```bibtex @article{Qwen3-VL, title={Qwen3-VL: Large Vision Language Models Pretrained on Massive Data}, author={Qwen Team}, journal={arXiv preprint}, year={2024} } ``` ### TRL (Transformer Reinforcement Learning) ```bibtex @misc{vonwerra2022trl, title = {{TRL: Transformer Reinforcement Learning}}, author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'{'e}}dec}, year = 2020, journal = {GitHub repository}, publisher = {GitHub}, howpublished = {\url{https://github.com/huggingface/trl}} } ``` --- *README generated automatically on 2025-10-24 10:40:41*