Quran 5-gram KenLM Language Model

A 5-gram statistical language model trained on the Quran corpus (6,348 verses), built with KenLM.

Training data

Source: quran-simple-clean.txt — 6,348 Quranic verses, one per line.
Augmented to 13,901 sentences by:
- Adding consecutive verse pairs with 80% probability (models natural recitation flow where verse N+1 follows verse N).
- Adding verse triplets (~24% probability).
- Adding ~15% self-repetitions (verse recited twice).

Files

File	Format	Description
`quran_5gram.arpa`	ARPA text	Portable n-gram model
`quran_5gram.binary`	KenLM binary	Fast probing hash
`config.json`	JSON	Model metadata

Usage

import kenlm

model = kenlm.Model("quran_5gram.binary")   # or .arpa

# Score a verse (log10 probability)
score = model.score("بسم الله الرحمن الرحيم")

# Perplexity
ppl = model.perplexity("الحمد لله رب العالمين")

# Score a verse sequence (consecutive recitation)
sequence = "بسم الله الرحمن الرحيم الحمد لله رب العالمين"
seq_score = model.score(sequence)

Install KenLM Python bindings: pip install kenlm

Downloads last month: 28

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support