SETU - Script-agnostic English Translation Unifier

SETU is a neural translation model that unifies multiscript, multilingual, and informal text into clean, formal English.

Model Description

The SETU model can handle:

Romanized Nepali to English translation
Devanagari Nepali to English translation
Code-mixed text to English translation
Informal/slang to formal English translation

Try It Out

🚀 Interactive Demo: Try SETU in Google Colab: https://colab.research.google.com/drive/1KdLiLtAKGK8_XLyFlEwSqGFPZZqGwl4n?usp=sharing

Installation

Ensure that you have transformers and onnx installed:

pip install transformers  onnxruntime

Usage

from transformers import AutoModel

# Load the model
model = AutoModel.from_pretrained("santoshdahal/setu", trust_remote_code=True)

# Translate text
result = model("mero name ramesh  ho")
print("Translation:", result)
# Output: "My name is Ramesh."

# Works with Devanagari script too
result = model("सामाजिक मिडिया र ग्राउण्ड वास्तविकता फरक छ।")
print("Translation:", result) 
# Output: "Social media and reality are different."

# Handles informal text
result = model("what is your nam")
print("Translation:", result)
# Output: "what's your name"

Model Details

Model Type: Neural Machine Translation
Architecture: Transformer
Vocabulary Size: 40,253 tokens
Languages Supported: Nepali (Romanized & Devanagari), English, Code-mixed text
Model Format: ONNX for efficient inference

Technical Implementation

The model uses:

ONNX Runtime for efficient inference
SentencePiece for tokenization
Beam search decoding with configurable beam size
Separate encoder and decoder ONNX models

Files Included

encoder.onnx: ONNX encoder model
decoder.onnx: ONNX decoder model
spm.model: SentencePiece tokenizer model
spm.vocab: SentencePiece vocabulary
config.json: Model configuration
modeling_setu_translation.py: Model implementation
configuration_setu_translation.py: Configuration class

Citation

If you use this model, please cite:

@misc{setu2025,
  title={SETU: Script-agnostic English Translation Unifier},
  author={Santosh Dahal},
  year={2025}
}

Downloads last month: 28

Evaluation results

BLEU on Nepali-English Mixed Dataset
self-reported

49.500

View on Papers With Code