Quran 5-gram KenLM Language Model

A 5-gram statistical language model trained on the Quran corpus (6,348 verses), built with KenLM.

Training data

  • Source: quran-simple-clean.txt โ€” 6,348 Quranic verses, one per line.
  • Augmented to 13,901 sentences by:
    • Adding consecutive verse pairs with 80% probability (models natural recitation flow where verse N+1 follows verse N).
    • Adding verse triplets (~24% probability).
    • Adding ~15% self-repetitions (verse recited twice).

Files

File Format Description
quran_5gram.arpa ARPA text Portable n-gram model
quran_5gram.binary KenLM binary Fast probing hash
config.json JSON Model metadata

Usage

import kenlm

model = kenlm.Model("quran_5gram.binary")   # or .arpa

# Score a verse (log10 probability)
score = model.score("ุจุณู… ุงู„ู„ู‡ ุงู„ุฑุญู…ู† ุงู„ุฑุญูŠู…")

# Perplexity
ppl = model.perplexity("ุงู„ุญู…ุฏ ู„ู„ู‡ ุฑุจ ุงู„ุนุงู„ู…ูŠู†")

# Score a verse sequence (consecutive recitation)
sequence = "ุจุณู… ุงู„ู„ู‡ ุงู„ุฑุญู…ู† ุงู„ุฑุญูŠู… ุงู„ุญู…ุฏ ู„ู„ู‡ ุฑุจ ุงู„ุนุงู„ู…ูŠู†"
seq_score = model.score(sequence)

Install KenLM Python bindings: pip install kenlm

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support