Quran 5-gram KenLM Language Model
A 5-gram statistical language model trained on the Quran corpus (6,348 verses), built with KenLM.
Training data
- Source:
quran-simple-clean.txtโ 6,348 Quranic verses, one per line. - Augmented to 13,901 sentences by:
- Adding consecutive verse pairs with 80% probability (models natural recitation flow where verse N+1 follows verse N).
- Adding verse triplets (~24% probability).
- Adding ~15% self-repetitions (verse recited twice).
Files
| File | Format | Description |
|---|---|---|
quran_5gram.arpa |
ARPA text | Portable n-gram model |
quran_5gram.binary |
KenLM binary | Fast probing hash |
config.json |
JSON | Model metadata |
Usage
import kenlm
model = kenlm.Model("quran_5gram.binary") # or .arpa
# Score a verse (log10 probability)
score = model.score("ุจุณู
ุงููู ุงูุฑุญู
ู ุงูุฑุญูู
")
# Perplexity
ppl = model.perplexity("ุงูุญู
ุฏ ููู ุฑุจ ุงูุนุงูู
ูู")
# Score a verse sequence (consecutive recitation)
sequence = "ุจุณู
ุงููู ุงูุฑุญู
ู ุงูุฑุญูู
ุงูุญู
ุฏ ููู ุฑุจ ุงูุนุงูู
ูู"
seq_score = model.score(sequence)
Install KenLM Python bindings: pip install kenlm
- Downloads last month
- 28
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support