π°π Khmer Tokenizer V2 β 18K Vocabulary
A compact and efficient Khmer tokenizer designed for use in NLP pipelines such as
classification, translation, summarization, and text generation.
Trained on diverse Khmer text sources, this tokenizer focuses on efficiency,
morphological accuracy, and perfect reconstruction during decoding.
Model Details
Model Description
- Developed by: Sok Meas (@Msok99)
- Model type: SentencePiece Unigram Tokenizer
- Language(s): Khmer
- License: MIT
- Finetuned from model: None (trained from scratch)
Model Sources
- Repository: https://huggingface.co/Msok99/18k_tokenizer_v2
Uses
Direct Use
- Tokenization for Khmer NLP models
- Embedding generation
- Text preprocessing for machine learning or fine-tuning tasks
Downstream Use
- Suitable for use with any Khmer-based LLM, classifier, or translation model
- Can be paired with encoder-decoder architectures (e.g., T5, mBART)
Out-of-Scope Use
- Not designed for semantic similarity or embedding search directly
- Not a model for language generation by itself
Bias, Risks, and Limitations
- May not perfectly segment highly colloquial or dialectal Khmer
- Some rare archaic terms could be split into smaller subwords
- The tokenizer is purely statistical (no semantic understanding)
Recommendations
Users fine-tuning Khmer models should ensure corpus cleaning consistency
and consider domain-specific retraining if using technical or code-mixed datasets.
How to Get Started with the Model
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Msok99/18k_tokenizer_v2")
text = "ααααα½αα’ααααααΆαα
ααααα
ααααΈααΌαααααΉαα"
tokens = tokenizer.tokenize(text)
print(tokens)
print(tokenizer.decode(tokenizer.encode(text)))
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support