--- language: ["khm"] tokenizer_type: "HybridBPE-MD" license: "mit" tags: - khmer - tokenizer - bpe-md - sentencepiece - hybrid - language-model - text-processing --- # 🇰🇭 Khmer BPE-MD-v3-SPM (Hybrid Tokenizer) **BPE-MD-v3-SPM** is a hybrid Khmer tokenizer that combines: - **BPE-MD (Morphology-Driven)** rules for Khmer word segmentation, and - **SentencePiece BPE** modeling for subword learning, coverage, and byte safety. This tokenizer is built for both Khmer and bilingual (Khmer + English + Math) text. It handles Unicode normalization, symbols, and numerics gracefully — ideal for LLMs, translation models, or RAG systems. --- ### 🧠 Features - **Hybrid design:** BPE-MD (morphology) × SentencePiece (subword) - **Script coverage:** Khmer + Latin + Math + Digits - **Vocab size:** 16 100 - **Character coverage:** 1.0 - **Includes:** user-defined math and chemical tokens (√, ², ₁₀, H₂O, log₁₀, etc.) --- ### 🧩 Example usage (from transformers import T5Tokenizer) (tok = T5Tokenizer.from_pretrained("Msok99/km-bpe-md-v3-spm")) (text = "ខ្ញុំបានគណនាថា √25 + 3² = 34") (print(tok.tokenize(text))) (print(tok.decode(tok.encode(text)))) --- ### 📊 Training details - **Base:** Khmer Morphology-Driven corpus (education, news, QA) - **Algorithm:** SentencePiece (BPE mode) - **User symbols:** Mathematical, scientific, and Khmer-digit patterns - **Goal:** Robust tokenization for LLM fine-tuning on Khmer + mixed-script data --- ### 📜 License MIT