T5-Base OCR Error Correction (Synthetic Data + Real Vampyre Text)

This model is a fine-tuned version of google-t5/t5-base for correcting OCR errors in historical texts, specifically trained on "The Vampyre" dataset.

🎯 Model Description

  • Base Model: google-t5/t5-base
  • Task: OCR error correction
  • Training Strategy:
    • Train/Val: Synthetic OCR data (1020 samples with GPT-4 generated errors)
    • Test: Real OCR data from "The Vampyre" (300 samples)
  • Best Checkpoint: Epoch 16
  • Validation CER: 13.93%
  • Validation WER: 22.52%

πŸ“Š Performance

Evaluated on real historical OCR text from "The Vampyre":

Metric Score
Character Error Rate (CER) 13.93%
Word Error Rate (WER) 22.52%
Exact Match 1.97%

πŸš€ Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("ejaz111/t5-synthetic-data-vampyre-ocr-correction")
model = AutoModelForSeq2SeqLM.from_pretrained("ejaz111/t5-synthetic-data-vampyre-ocr-correction")

# Correct OCR errors
ocr_text = "Th1s 1s an 0CR err0r w1th m1stakes in the anc1ent text."
input_ids = tokenizer(ocr_text, return_tensors="pt", max_length=512, truncation=True).input_ids
outputs = model.generate(input_ids, max_length=512)
corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Original:  {ocr_text}")
print(f"Corrected: {corrected_text}")

Using Pipeline

from transformers import pipeline

corrector = pipeline("text2text-generation", model="ejaz111/t5-synthetic-data-vampyre-ocr-correction")
result = corrector("The breeze wh15pered so7tly through the anci3nt tre55")[0]['generated_text']
print(result)
# Output: "The breeze whispered softly through the ancient trees"

πŸŽ“ Training Details

Training Data

  • Synthetic Data (Train/Val): 1020 samples
    • 85% training (~867 samples)
    • 15% validation (~153 samples)
    • Generated using GPT-4 with 20 corruption strategies
  • Real Data (Test): 300 samples from "The Vampyre" OCR text
  • No data leakage: Test set contains only real OCR data, never seen during training

Training Configuration

  • Epochs: 20 (best model at epoch 16)
  • Batch Size: 16
  • Learning Rate: 1e-4
  • Optimizer: AdamW with weight decay 0.01
  • Scheduler: Linear with warmup (10% warmup steps)
  • Max Sequence Length: 512 tokens
  • Early Stopping: Monitored validation CER

Corruption Strategies (Training Data)

The synthetic training data included these OCR error types:

  • Character substitutions (visual similarity)
  • Missing/extra characters
  • Word boundary errors
  • Case errors
  • Punctuation errors
  • Long s (ΕΏ) substitutions
  • Historical typography errors

πŸ“ˆ Training Progress

The model showed consistent improvement:

  • Epoch 5: CER 26.87%
  • Epoch 10: CER 15.20%
  • Epoch 16: CER 13.93% ⭐ (Best)
  • Plateau after epoch 16

πŸ’‘ Use Cases

This model is particularly effective for:

  • Correcting OCR errors in historical documents
  • Post-processing digitized manuscripts
  • Cleaning text from scanned historical books
  • Literary text restoration
  • Academic research on historical texts

⚠️ Limitations

  • Optimized for English historical texts
  • Best performance on texts similar to 19th-century literature
  • May struggle with extremely degraded or non-standard OCR
  • Maximum input length: 512 tokens

πŸ”¬ Evaluation Examples

Original OCR Corrected Output
"Th1s 1s an 0CR err0r" "This is an OCR error"
"The anci3nt tre55" "The ancient trees"
"bl0omiNg floweRs" "blooming flowers"

πŸ“š Citation

If you use this model in your research, please cite:

@misc{t5-vampyre-ocr,
  author = {Ejaz},
  title = {T5 Base OCR Error Correction for Historical Texts},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ejaz111/t5-synthetic-data-vampyre-ocr-correction}}
}

πŸ‘€ Author

Ejaz - Master's Student in AI and Robotics

πŸ“„ License

Apache 2.0

πŸ™ Acknowledgments

  • Base model: google-t5/t5-base
  • Training data: "The Vampyre" by John William Polidori
  • Synthetic data generation: GPT-4
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results

  • Character Error Rate on The Vampyre (Synthetic + Real)
    self-reported
    13.930
  • Word Error Rate on The Vampyre (Synthetic + Real)
    self-reported
    22.520