πŸ‡°πŸ‡­ Khmer Tokenizer V2 – 18K Vocabulary

A compact and efficient Khmer tokenizer designed for use in NLP pipelines such as
classification, translation, summarization, and text generation.

Trained on diverse Khmer text sources, this tokenizer focuses on efficiency,
morphological accuracy, and perfect reconstruction during decoding.


Model Details

Model Description

  • Developed by: Sok Meas (@Msok99)
  • Model type: SentencePiece Unigram Tokenizer
  • Language(s): Khmer
  • License: MIT
  • Finetuned from model: None (trained from scratch)

Model Sources


Uses

Direct Use

  • Tokenization for Khmer NLP models
  • Embedding generation
  • Text preprocessing for machine learning or fine-tuning tasks

Downstream Use

  • Suitable for use with any Khmer-based LLM, classifier, or translation model
  • Can be paired with encoder-decoder architectures (e.g., T5, mBART)

Out-of-Scope Use

  • Not designed for semantic similarity or embedding search directly
  • Not a model for language generation by itself

Bias, Risks, and Limitations

  • May not perfectly segment highly colloquial or dialectal Khmer
  • Some rare archaic terms could be split into smaller subwords
  • The tokenizer is purely statistical (no semantic understanding)

Recommendations

Users fine-tuning Khmer models should ensure corpus cleaning consistency
and consider domain-specific retraining if using technical or code-mixed datasets.


How to Get Started with the Model

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Msok99/18k_tokenizer_v2")
text = "αž€αŸ’αžšαžŸαž½αž„αž’αž”αŸ‹αžšαŸ†αž”αžΆαž“αž…αŸαž‰αžŸαŸαž…αž€αŸ’αžαžΈαž‡αžΌαž“αžŠαŸ†αžŽαžΉαž„αŸ”"
tokens = tokenizer.tokenize(text)
print(tokens)
print(tokenizer.decode(tokenizer.encode(text)))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support