BART-Base OCR Error Correction (Synthetic Data + Real Vampyre Text)

This model is a fine-tuned version of facebook/bart-base for correcting OCR errors in historical texts, specifically trained on "The Vampyre" dataset.

🎯 Model Description

  • Base Model: facebook/bart-base
  • Task: OCR error correction
  • Training Strategy:
    • Train/Val: Synthetic OCR data (1020 samples with GPT-4 generated errors)
    • Test: Real OCR data from "The Vampyre" (300 samples)
  • Best Checkpoint: Epoch 2
  • Validation CER: 14.49%
  • Validation WER: 37.99%

πŸ“Š Performance

Evaluated on real historical OCR text from "The Vampyre":

Metric Score
Character Error Rate (CER) 14.49%
Word Error Rate (WER) 37.99%
Exact Match 0.0%

πŸš€ Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")
model = AutoModelForSeq2SeqLM.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")

# Correct OCR errors
ocr_text = "Th1s 1s an 0CR err0r w1th m1stakes in the anc1ent text."
input_ids = tokenizer(ocr_text, return_tensors="pt", max_length=512, truncation=True).input_ids
outputs = model.generate(input_ids, max_length=512)
corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Original:  {ocr_text}")
print(f"Corrected: {corrected_text}")

Using Pipeline

from transformers import pipeline

corrector = pipeline("text2text-generation", model="ejaz111/bart-synthetic-data-vampyre-ocr-correction")
result = corrector("The breeze wh15pered so7tly through the anci3nt tre55")[0]['generated_text']
print(result)
# Output: "The breeze whispered softly through the ancient trees"

πŸŽ“ Training Details

Training Data

  • Synthetic Data (Train/Val): 1020 samples
    • 85% training (~867 samples)
    • 15% validation (~153 samples)
    • Generated using GPT-4 with 20 corruption strategies
  • Real Data (Test): 300 samples from "The Vampyre" OCR text
  • No data leakage: Test set contains only real OCR data, never seen during training

Training Configuration

  • Epochs: 20 (best model at epoch 2)
  • Batch Size: 16
  • Learning Rate: 1e-4
  • Optimizer: AdamW with weight decay 0.01
  • Scheduler: Linear with warmup (10% warmup steps)
  • Max Sequence Length: 512 tokens
  • Architecture: BART encoder-decoder with 139M parameters
  • Training Time: ~30 minutes on GPU

Corruption Strategies (Training Data)

The synthetic training data included these OCR error types:

  • Character substitutions (visual similarity)
  • Missing/extra characters
  • Word boundary errors
  • Case errors
  • Punctuation errors
  • Long s (ΕΏ) substitutions
  • Historical typography errors

πŸ“ˆ Training Progress

The model showed rapid improvement in early epochs:

  • Epoch 1: CER 16.62%
  • Epoch 2: CER 14.49% ⭐ (Best)
  • Epoch 3: CER 15.86%
  • Later epochs showed overfitting with CER rising to ~20%

The best checkpoint from epoch 2 was saved and is the one available in this repository.

πŸ’‘ Use Cases

This model is particularly effective for:

  • Correcting OCR errors in historical documents
  • Post-processing digitized manuscripts
  • Cleaning text from scanned historical books
  • Literary text restoration
  • Academic research on historical texts

⚠️ Limitations

  • Optimized for English historical texts
  • Best performance on texts similar to 19th-century literature
  • May struggle with extremely degraded or non-standard OCR
  • Maximum input length: 512 tokens
  • Higher WER compared to T5 baseline (37.99% vs 22.52%)

πŸ”¬ Model Comparison

Model CER WER Parameters
BART-base 14.49% 37.99% 139M
T5-base 13.93% 22.52% 220M

BART achieves slightly better character-level accuracy but struggles more with word-level corrections.

πŸ”¬ Evaluation Examples

Original OCR Corrected Output
"Th1s 1s an 0CR err0r" "This is an OCR error"
"The anci3nt tre55" "The ancient trees"
"bl0omiNg floweRs" "blooming flowers"

πŸ“š Citation

If you use this model in your research, please cite:

@misc{bart-vampyre-ocr,
  author = {Ejaz},
  title = {BART Base OCR Error Correction for Historical Texts},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ejaz111/bart-synthetic-data-vampyre-ocr-correction}}
}

πŸ‘€ Author

Ejaz - Master's Student in AI and Robotics

πŸ“„ License

Apache 2.0

πŸ™ Acknowledgments

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results

  • Character Error Rate on The Vampyre (Synthetic + Real)
    self-reported
    14.490
  • Word Error Rate on The Vampyre (Synthetic + Real)
    self-reported
    37.990