BAREC Strict Track Document-Level Readability Model
Overview
This model is designed for fine-grained Arabic readability assessment at the Document level, developed for the BAREC Shared Task 2025 (Strict Track). It is based on AraBERTv2 and fine-tuned using the BAREC corpus with a 19-level readability classification. The model uses D3Tok input variants and a combination of Cross-Entropy (CE) and Quadratic Weighted Kappa (WKL) losses.
Intended Uses & Limitations
- Intended use: Predicting the readability of Arabic sentences or documents (scale 1-19)
- Domain: Modern Standard Arabic, educational content
Model Details
- Base model: CAMeL-Lab/readability-arabertv02-word-CE
- Input variant: D3Tok (token-level)
- Labels: 19 readability levels (1 = easiest, 19 = hardest)
- Losses: CE → WKL (for best results)
- Strict track: Document
- Best QWK: 81.9% (document-level)
Training Data
- Corpus: BAREC Corpus v1.0
- Train/Val/Test split: Train (80%), Dev (10%), and Test (10%).
- Preprocessing: Input variant generated using the official scripts (D3Tok)
- Cleaning: No additional cleaning, only official preprocessing
Training Procedure
- Loss functions: Cross-Entropy, then Quadratic Weighted Kappa (WKL)
- Hyperparameters:
- Learning rate: 1e-5
- Batch size: 32
- Epochs: 8
- Scheduler: cosine_with_restarts
- Weight Decay: 0.05
- fp16: enabled
- Metrics: QWK (Quadratic Weighted Kappa), macro F1, accuracy
Evaluation Results
| Split | QWK |
|---|---|
| Validation | 81.9% |
| Test (Public) | 82.8% |
| Blind Test* | 79.0% |
Usage Example
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load the model and tokenizer
model_name = "shymaa25/barec-readability-doc-arabertv02-word-ce-wkl-strict"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Split your document into sentences (as a list)
sentences = [
"هذه جملة سهلة.",
"هذه الجملة أكثر تعقيدًا وتتطلب مستوى قراءة أعلى.",
"جملة متوسطة الصعوبة."
]
# Predict readability for each sentence and select the hardest
levels = []
for sentence in sentences:
inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
pred = torch.argmax(outputs.logits, dim=1).item() + 1 # labels 1–19
levels.append(pred)
doc_level = max(levels) # Hardest sentence determines doc level
print(f"Document readability level: {doc_level}")
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for shymaa25/barec-readability-doc-arabertv02-word-ce-wkl-strict
Base model
aubmindlab/bert-base-arabertv02
Finetuned
CAMeL-Lab/readability-arabertv02-word-CE