NLLB-350M-EN-KM-v10
Model Description
This model is a compact English-to-Khmer neural machine translation model created through knowledge distillation from NLLB-200. This is the research evaluation version with full 10-epoch training, achieving competitive translation quality with 42% fewer parameters than the baseline.
- Developed by: Chealyfey Vutha
 - Model type: Sequence-to-sequence transformer for machine translation
 - Language(s): English to Khmer (en → km)
 - License: CC-BY-NC 4.0
 - Base model: facebook/nllb-200-distilled-600M
 - Teacher model: facebook/nllb-200-1.3B
 - Parameters: 350M (42% reduction from 600M baseline)
 
Model Details
Architecture
- Encoder layers: 3 (reduced from 12)
 - Decoder layers: 3 (reduced from 12)
 - Hidden size: 1024
 - Attention heads: 16
 - Total parameters: ~350M
 
Training Procedure
- Distillation method: Temperature-scaled knowledge distillation
 - Teacher model: NLLB-200-1.3B
 - Temperature: 5.0
 - Lambda (loss weighting): 0.5
 - Training epochs: 10 (full training)
 - Training data: 316,110 English-Khmer pairs (generated via DeepSeek API)
 - Hardware: NVIDIA A100-SXM4-80GB
 
Intended Uses
Direct Use
This model is intended for:
- Production English-to-Khmer translation applications
 - Research on efficient neural machine translation
 - Cambodian language technology development
 - Cultural preservation through digital translation tools
 
Downstream Use
- Integration into mobile translation apps
 - Website localization services
 - Educational language learning platforms
 - Government and NGO translation services in Cambodia
 
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig
# Configuration
CONFIG = {
"model_name": "lyfeyvutha/nllb_350M_en_km_v10",
"tokenizer_name": "facebook/nllb-200-distilled-600M",
"source_lang": "eng_Latn",
"target_lang": "khm_Khmr",
"max_length": 128
}
# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(CONFIG["model_name"])
tokenizer = AutoTokenizer.from_pretrained(
CONFIG["tokenizer_name"],
src_lang=CONFIG["source_lang"],
tgt_lang=CONFIG["target_lang"]
)
# Set up generation configuration
khm_token_id = tokenizer.convert_tokens_to_ids(CONFIG["target_lang"])
generation_config = GenerationConfig(
max_length=CONFIG["max_length"],
forced_bos_token_id=khm_token_id
)
# Translate
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
translation = tokenizer.decode(outputs, skip_special_tokens=True)
print(translation)
Training Details
Training Data
- Dataset size: 316,110 English-Khmer sentence pairs
 - Data source: Synthetic data generated using DeepSeek translation API
 - Preprocessing: Tokenized using NLLB-200 tokenizer with max length 128
 
Training Hyperparameters
- Batch size: 48
 - Learning rate: 3e-5
 - Optimizer: AdamW
 - LR scheduler: Cosine
 - Training epochs: 10
 - Hardware: NVIDIA A100-SXM4-80GB with CUDA 12.8
 
Training Progress
| Epoch | Training Loss | Validation Loss | 
|---|---|---|
| 1 | 0.658600 | 0.674992 | 
| 2 | 0.534500 | 0.596366 | 
| 3 | 0.484700 | 0.566999 | 
| 4 | 0.453800 | 0.549162 | 
| 5 | 0.436300 | 0.542330 | 
| 6 | 0.432900 | 0.536817 | 
| 7 | 0.421000 | 0.534668 | 
| 8 | 0.412800 | 0.532001 | 
| 9 | 0.417400 | 0.533419 | 
| 10 | 0.413200 | 0.531947 | 
Evaluation
Testing Data
The model was evaluated on the Asian Language Treebank (ALT) corpus, containing manually translated English-Khmer pairs from English Wikinews articles.
Metrics
| Metric | Our Model (350M) | Baseline (600M) | Improvement | 
|---|---|---|---|
| chrF Score | 38.83 | 43.88 | -5.05 points | 
| BERTScore F1 | 0.8608 | 0.8573 | +0.0035 | 
| Parameters | 350M | 600M | -42% | 
Results
- Achieves 88.5% of baseline chrF performance with 42% fewer parameters
 - Actually improves on BERTScore F1, indicating better semantic similarity
 - Significant computational efficiency gains for deployment scenarios
 
Performance Comparison
| Model | Parameters | chrF Score | BERTScore F1 | Efficiency Gain | 
|---|---|---|---|---|
| NLLB-350M-EN-KM (Ours) | 350M | 38.83 | 0.8608 | 42% smaller | 
| NLLB-200-Distilled-600M | 600M | 43.88 | 0.8573 | Baseline | 
Limitations and Bias
Limitations
- Performance trade-off: 5-point chrF decrease compared to larger baseline
 - Synthetic training data: May not capture all real-world linguistic variations
 - Domain dependency: Performance may vary across different text types
 - Low-resource constraints: Limited by available English-Khmer parallel data
 
Bias Considerations
- Training data generated via translation API may inherit source model biases
 - Limited representation of Khmer dialects and regional variations
 - Potential gender, cultural, and socioeconomic biases in translation outputs
 - Urban vs. rural language usage patterns may not be equally represented
 
Ethical Considerations
- Model designed to support Cambodian language preservation and digital inclusion
 - Users should validate translations for sensitive or critical applications
 - Consider cultural context when deploying in official or educational settings
 
Environmental Impact
- Hardware: Training performed on single NVIDIA A100-SXM4-80GB
 - Training time: Approximately 10 hours for full training
 - Energy efficiency: Significantly more efficient than training from scratch
 - Deployment efficiency: 42% reduction in computational requirements
 
Citation
@misc{nllb350m_en_km_v10_2025, title={NLLB-350M-EN-KM-v10: Efficient English-Khmer Neural Machine Translation via Knowledge Distillation}, author={Chealyfey Vutha}, year={2025}, url={https://huggingface.co/lyfeyvutha/nllb_350M_en_km_v10} }
Acknowledgments
This work builds upon Meta's NLLB-200 models and uses the Asian Language Treebank (ALT) corpus for evaluation.
Model Card Contact
For questions or feedback about this model card: [email protected]
- Downloads last month
 - -
 
Model tree for lyfeyvutha/nllb_350M_en_km_v10
Base model
facebook/nllb-200-distilled-600MDataset used to train lyfeyvutha/nllb_350M_en_km_v10
Evaluation results
- chrf on Asian Language Treebank (ALT)self-reported38.830
 - bertscore on Asian Language Treebank (ALT)self-reported0.861