Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources
Paper
• 2507.09245 • Published
This model transliterates Romanized Sinhala text to Sinhala script.
This is a fine-tuned version of facebook/mbart-large-50-many-to-many-mmt specialized for Sinhala transliteration. It converts romanized Sinhala (using Latin characters) to proper Sinhala script.
This model is intended for transliterating Romanized Sinhala text to proper Sinhala script. It can be useful for:
from transformers import MBartForConditionalGeneration, MBartTokenizerFast
# Load model and tokenizer
model_name = "deshanksuman/mbart_50_SinhalaTransliteration"
tokenizer = MBartTokenizerFast.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)
# Set language codes
tokenizer.src_lang = "en_XX" # Using English as source language token
tokenizer.tgt_lang = "si_LK" # Sinhala as target
# Prepare input
text = "heta api mkda krnne"
inputs = tokenizer(text, return_tensors="pt", max_length=128, padding="max_length", truncation=True)
# Generate output
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_length=96,
num_beams=5,
early_stopping=True
)
# Decode output
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
The model was trained on the deshanksuman/SwaBhasha_Transliteration_Sinhala dataset, which contains pairs of Romanized Sinhala and corresponding Sinhala script text.
The model was trained with the following parameters:
This is trained for sentence level.
Example 1:
Example 2:
Example 3:
@article{sumanathilaka2025swa,
title={Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources},
author={Sumanathilaka, Deshan and Perera, Sameera and Dharmasiri, Sachithya and Athukorala, Maneesha and Herath, Anuja Dilrukshi and Dias, Rukshan and Gamage, Pasindu and Weerasinghe, Ruvan and Priyadarshana, YHPP},
journal={arXiv preprint arXiv:2507.09245},
year={2025}
}