alakxender/dhivehi-news-corpus
Viewer • Updated • 87.2k • 207
How to use alakxender/dhivehi-T5-tokenizer-extended with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("alakxender/dhivehi-T5-tokenizer-extended", dtype="auto")This tokenizer extends google/flan-t5-base to support Dhivehi (Thaana script) characters, while preserving English subword tokenization.
google/flan-t5-base_This, _is, etc.)from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("alakxender/dhivehi-T5-tokenizer-extended")
# Dhivehi example
text = "ހެނބަދޫ މުދިންބެ: ކުދިން ހަތިމްކުރާ ކޮންމެ ފަހަރަކު ލޮލުން ކަރުނަ!"
tokens = tokenizer.tokenize(text)
print(tokens)
| Feature | Stock Flan-T5 Tokenizer | Extended Dhivehi Tokenizer |
|---|---|---|
| Dhivehi support | ❌ Uses | Proper tokenization |
| English tokenization | ✅ Yes | Preserved |
| Added tokens | ❌ No | Thaana characters |
The stock flan-t5-base tokenizer does not support Dhivehi text properly:
from transformers import AutoTokenizer
stock_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
tokens = stock_tokenizer.tokenize("ހެނބަދޫ މުދިންބެ: ކުދިން ހަތިމްކުރާ ކޮންމެ ފަހަރަކު ލޮލުން ކަރުނަ!")
print(tokens)
# Output: ['<unk>', '<unk>']
In contrast, the extended tokenizer will tokenize Thaana characters individually or as learned units, preserving semantics and avoiding <unk> tokens.