mBART-50 Sinhala Transliteration Model

This model transliterates Romanized Sinhala text to Sinhala script.

Model description

This is a fine-tuned version of facebook/mbart-large-50-many-to-many-mmt specialized for Sinhala transliteration. It converts romanized Sinhala (using Latin characters) to proper Sinhala script.

Intended uses & limitations

This model is intended for transliterating Romanized Sinhala text to proper Sinhala script. It can be useful for:

Text input conversion in applications
Helping non-native speakers type in Sinhala
Converting legacy text in romanized format to proper Sinhala

How to use

from transformers import MBartForConditionalGeneration, MBartTokenizerFast

# Load model and tokenizer
model_name = "deshanksuman/mbart_50_SinhalaTransliteration"
tokenizer = MBartTokenizerFast.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)

# Set language codes
tokenizer.src_lang = "en_XX"  # Using English as source language token
tokenizer.tgt_lang = "si_LK"  # Sinhala as target

# Prepare input
text = "heta api mkda krnne"
inputs = tokenizer(text, return_tensors="pt", max_length=128, padding="max_length", truncation=True)

# Generate output
outputs = model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_length=96,
    num_beams=5,
    early_stopping=True
)

# Decode output
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Training data

The model was trained on the deshanksuman/SwaBhasha_Transliteration_Sinhala dataset, which contains pairs of Romanized Sinhala and corresponding Sinhala script text.

Training procedure

The model was trained with the following parameters:

Learning rate: 5e-05
Batch size: 16
Number of epochs: 2
Max sequence length: 128
Optimizer: AdamW

This is trained for sentence level.

Examples:

Example 1:

Romanized: Dakunu koreyawe eithihasika
Expected: දකුණු කොරියාවේ ඓතිහාසික
Predicted: දකුණු කොරියාවේ ඓතිහාසික
Correct: True

Example 2:

Romanized: Okoma hodai ganu gathiya
Expected: ඔක්කොම හොදයි ගෑනු ගතිය
Predicted: ඕකම හොදයි ගනු ගතිය
Correct: False

Example 3:

Romanized: Malki akkith ennwa nedenntm godak kemathiyakkila dennm supiriyatam dance
Expected: මල්කි අක්කිත් එනව නෙදෙන්නටම ගොඩක් කෑමතියිඅක්කිල දෙන්නම සුපිරියටම ඩාන්ස්
Predicted: මල්කි අක්කිත් එන්නව නෑද්දෑන්ත්ම ගොඩක් කෑමතියිඅකිල දෑන්ඩම් සුපිරියටම ඩාන්ස්
Correct: False

Citation

@article{sumanathilaka2025swa,
  title={Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources},
  author={Sumanathilaka, Deshan and Perera, Sameera and Dharmasiri, Sachithya and Athukorala, Maneesha and Herath, Anuja Dilrukshi and Dias, Rukshan and Gamage, Pasindu and Weerasinghe, Ruvan and Priyadarshana, YHPP},
  journal={arXiv preprint arXiv:2507.09245},
  year={2025}
}

Downloads last month: 62

Safetensors

Model size

0.6B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train deshanksuman/mbart_50_SinhalaTransliteration

Paper for deshanksuman/mbart_50_SinhalaTransliteration

Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources

Paper • 2507.09245 • Published Jul 12, 2025