T5-Base OCR Error Correction (Synthetic Data + Real Vampyre Text)
This model is a fine-tuned version of google-t5/t5-base for correcting OCR errors in historical texts, specifically trained on "The Vampyre" dataset.
π― Model Description
- Base Model: google-t5/t5-base
- Task: OCR error correction
- Training Strategy:
- Train/Val: Synthetic OCR data (1020 samples with GPT-4 generated errors)
- Test: Real OCR data from "The Vampyre" (300 samples)
- Best Checkpoint: Epoch 16
- Validation CER: 13.93%
- Validation WER: 22.52%
π Performance
Evaluated on real historical OCR text from "The Vampyre":
| Metric | Score |
|---|---|
| Character Error Rate (CER) | 13.93% |
| Word Error Rate (WER) | 22.52% |
| Exact Match | 1.97% |
π Usage
Quick Start
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("ejaz111/t5-synthetic-data-vampyre-ocr-correction")
model = AutoModelForSeq2SeqLM.from_pretrained("ejaz111/t5-synthetic-data-vampyre-ocr-correction")
# Correct OCR errors
ocr_text = "Th1s 1s an 0CR err0r w1th m1stakes in the anc1ent text."
input_ids = tokenizer(ocr_text, return_tensors="pt", max_length=512, truncation=True).input_ids
outputs = model.generate(input_ids, max_length=512)
corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Original: {ocr_text}")
print(f"Corrected: {corrected_text}")
Using Pipeline
from transformers import pipeline
corrector = pipeline("text2text-generation", model="ejaz111/t5-synthetic-data-vampyre-ocr-correction")
result = corrector("The breeze wh15pered so7tly through the anci3nt tre55")[0]['generated_text']
print(result)
# Output: "The breeze whispered softly through the ancient trees"
π Training Details
Training Data
- Synthetic Data (Train/Val): 1020 samples
- 85% training (~867 samples)
- 15% validation (~153 samples)
- Generated using GPT-4 with 20 corruption strategies
- Real Data (Test): 300 samples from "The Vampyre" OCR text
- No data leakage: Test set contains only real OCR data, never seen during training
Training Configuration
- Epochs: 20 (best model at epoch 16)
- Batch Size: 16
- Learning Rate: 1e-4
- Optimizer: AdamW with weight decay 0.01
- Scheduler: Linear with warmup (10% warmup steps)
- Max Sequence Length: 512 tokens
- Early Stopping: Monitored validation CER
Corruption Strategies (Training Data)
The synthetic training data included these OCR error types:
- Character substitutions (visual similarity)
- Missing/extra characters
- Word boundary errors
- Case errors
- Punctuation errors
- Long s (ΕΏ) substitutions
- Historical typography errors
π Training Progress
The model showed consistent improvement:
- Epoch 5: CER 26.87%
- Epoch 10: CER 15.20%
- Epoch 16: CER 13.93% β (Best)
- Plateau after epoch 16
π‘ Use Cases
This model is particularly effective for:
- Correcting OCR errors in historical documents
- Post-processing digitized manuscripts
- Cleaning text from scanned historical books
- Literary text restoration
- Academic research on historical texts
β οΈ Limitations
- Optimized for English historical texts
- Best performance on texts similar to 19th-century literature
- May struggle with extremely degraded or non-standard OCR
- Maximum input length: 512 tokens
π¬ Evaluation Examples
| Original OCR | Corrected Output |
|---|---|
| "Th1s 1s an 0CR err0r" | "This is an OCR error" |
| "The anci3nt tre55" | "The ancient trees" |
| "bl0omiNg floweRs" | "blooming flowers" |
π Citation
If you use this model in your research, please cite:
@misc{t5-vampyre-ocr,
author = {Ejaz},
title = {T5 Base OCR Error Correction for Historical Texts},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ejaz111/t5-synthetic-data-vampyre-ocr-correction}}
}
π€ Author
Ejaz - Master's Student in AI and Robotics
π License
Apache 2.0
π Acknowledgments
- Base model: google-t5/t5-base
- Training data: "The Vampyre" by John William Polidori
- Synthetic data generation: GPT-4
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Evaluation results
- Character Error Rate on The Vampyre (Synthetic + Real)self-reported13.930
- Word Error Rate on The Vampyre (Synthetic + Real)self-reported22.520