FlanT5-Small Grammar Correction

Fine-tuned google/flan-t5-small on the grammarly/coedit dataset for English Grammar Error Correction (GEC).

Training Details

  • Base model: google/flan-t5-small (77M params)
  • Dataset: grammarly/coedit (GEC subset, 2000 training examples)
  • Training recipe: Based on CoEdIT paper (EMNLP 2023)
  • Epochs: 3
  • Learning rate: 3e-4
  • Final training loss: 0.27

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("xhimanshuz/flan-t5-small-grammar-correction")
model = AutoModelForSeq2SeqLM.from_pretrained("xhimanshuz/flan-t5-small-grammar-correction")

text = "Fix the grammar: I goes to school yesterday and learn many thing."
inputs = tokenizer(text, return_tensors="pt", max_length=128, truncation=True)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: I went to school yesterday and learned many things.

Supported Instructions

Use instruction prefixes from the CoEdIT format:

  • "Fix the grammar: <text>"
  • "Fix grammatical errors in this sentence: <text>"
  • "Improve the grammaticality: <text>"
  • "Remove all grammatical errors from this text: <text>"

Example Results

Input Output
I goes to school yesterday and learn many thing. I went to school yesterday and learned many things.
She don't know what are she doing. She doesn't know what she is doing.
The informations was very helpfull for our researchs. The information was very helpful for our research.
He have went to the market and buyed some apple. He has gone to the market and bought some apple.
The childs was playing in park when it start raining. The children were playing in the park when it started raining.

Training Loss Curve

Step Loss Epoch
1 0.669 0.00
100 0.484 0.40
250 0.448 1.00
500 0.325 2.00
750 0.292 3.00

Scaling Up

This model was trained on a 2000-example subset on CPU as a demonstration. For better performance:

  1. More data: Train on the full 19K GEC examples from grammarly/coedit, or all 69K examples (including simplification, paraphrasing, etc.)
  2. Larger model: Use google/flan-t5-base (250M) or google/flan-t5-large (770M)
  3. GPU training: Use A10G or A100 GPUs for faster training with larger batch sizes
  4. More epochs: Train for 5 epochs with early stopping (CoEdIT paper recipe)

Citation

@inproceedings{raheja2023coedit,
  title={CoEdIT: Text Editing by Task-Specific Instruction Tuning},
  author={Raheja, Vipul and Kumar, Dhruv and Koo, Ryan and Kang, Dongyeop},
  booktitle={EMNLP 2023},
  year={2023}
}
Downloads last month
229
Safetensors
Model size
77M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xhimanshuz/flan-t5-small-grammar-correction

Finetuned
(499)
this model

Dataset used to train xhimanshuz/flan-t5-small-grammar-correction

Paper for xhimanshuz/flan-t5-small-grammar-correction