T5-Base OCR Error Correction (Synthetic Data + Real Vampyre Text)

This model is a fine-tuned version of google-t5/t5-base for correcting OCR errors in historical texts, specifically trained on "The Vampyre" dataset.

🎯 Model Description

Base Model: google-t5/t5-base
Task: OCR error correction
Training Strategy:
- Train/Val: Synthetic OCR data (1020 samples with GPT-4 generated errors)
- Test: Real OCR data from "The Vampyre" (300 samples)
Best Checkpoint: Epoch 16
Validation CER: 13.93%
Validation WER: 22.52%

📊 Performance

Evaluated on real historical OCR text from "The Vampyre":

Metric	Score
Character Error Rate (CER)	13.93%
Word Error Rate (WER)	22.52%
Exact Match	1.97%

🚀 Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("ejaz111/t5-synthetic-data-vampyre-ocr-correction")
model = AutoModelForSeq2SeqLM.from_pretrained("ejaz111/t5-synthetic-data-vampyre-ocr-correction")

# Correct OCR errors
ocr_text = "Th1s 1s an 0CR err0r w1th m1stakes in the anc1ent text."
input_ids = tokenizer(ocr_text, return_tensors="pt", max_length=512, truncation=True).input_ids
outputs = model.generate(input_ids, max_length=512)
corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Original:  {ocr_text}")
print(f"Corrected: {corrected_text}")

Using Pipeline

from transformers import pipeline

corrector = pipeline("text2text-generation", model="ejaz111/t5-synthetic-data-vampyre-ocr-correction")
result = corrector("The breeze wh15pered so7tly through the anci3nt tre55")[0]['generated_text']
print(result)
# Output: "The breeze whispered softly through the ancient trees"

🎓 Training Details

Training Data

Synthetic Data (Train/Val): 1020 samples
- 85% training (~867 samples)
- 15% validation (~153 samples)
- Generated using GPT-4 with 20 corruption strategies
Real Data (Test): 300 samples from "The Vampyre" OCR text
No data leakage: Test set contains only real OCR data, never seen during training

Training Configuration

Epochs: 20 (best model at epoch 16)
Batch Size: 16
Learning Rate: 1e-4
Optimizer: AdamW with weight decay 0.01
Scheduler: Linear with warmup (10% warmup steps)
Max Sequence Length: 512 tokens
Early Stopping: Monitored validation CER

Corruption Strategies (Training Data)

The synthetic training data included these OCR error types:

Character substitutions (visual similarity)
Missing/extra characters
Word boundary errors
Case errors
Punctuation errors
Long s (ſ) substitutions
Historical typography errors

📈 Training Progress

The model showed consistent improvement:

Epoch 5: CER 26.87%
Epoch 10: CER 15.20%
Epoch 16: CER 13.93% ⭐ (Best)
Plateau after epoch 16

💡 Use Cases

This model is particularly effective for:

Correcting OCR errors in historical documents
Post-processing digitized manuscripts
Cleaning text from scanned historical books
Literary text restoration
Academic research on historical texts

⚠️ Limitations

Optimized for English historical texts
Best performance on texts similar to 19th-century literature
May struggle with extremely degraded or non-standard OCR
Maximum input length: 512 tokens

🔬 Evaluation Examples

Original OCR	Corrected Output
"Th1s 1s an 0CR err0r"	"This is an OCR error"
"The anci3nt tre55"	"The ancient trees"
"bl0omiNg floweRs"	"blooming flowers"

📚 Citation

If you use this model in your research, please cite:

@misc{t5-vampyre-ocr,
  author = {Ejaz},
  title = {T5 Base OCR Error Correction for Historical Texts},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ejaz111/t5-synthetic-data-vampyre-ocr-correction}}
}

👤 Author

Ejaz - Master's Student in AI and Robotics

📄 License

Apache 2.0

🙏 Acknowledgments

Base model: google-t5/t5-base
Training data: "The Vampyre" by John William Polidori
Synthetic data generation: GPT-4

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

Character Error Rate on The Vampyre (Synthetic + Real)
self-reported

13.930
Word Error Rate on The Vampyre (Synthetic + Real)
self-reported

22.520

View on Papers With Code