BAREC Strict Track Document-Level Readability Model

Overview

This model is designed for fine-grained Arabic readability assessment at the Document level, developed for the BAREC Shared Task 2025 (Strict Track). It is based on AraBERTv2 and fine-tuned using the BAREC corpus with a 19-level readability classification. The model uses D3Tok input variants and a combination of Cross-Entropy (CE) and Quadratic Weighted Kappa (WKL) losses.

Intended Uses & Limitations

  • Intended use: Predicting the readability of Arabic sentences or documents (scale 1-19)
  • Domain: Modern Standard Arabic, educational content

Model Details

  • Base model: CAMeL-Lab/readability-arabertv02-word-CE
  • Input variant: D3Tok (token-level)
  • Labels: 19 readability levels (1 = easiest, 19 = hardest)
  • Losses: CE → WKL (for best results)
  • Strict track: Document
  • Best QWK: 81.9% (document-level)

Training Data

  • Corpus: BAREC Corpus v1.0
  • Train/Val/Test split: Train (80%), Dev (10%), and Test (10%).
  • Preprocessing: Input variant generated using the official scripts (D3Tok)
  • Cleaning: No additional cleaning, only official preprocessing

Training Procedure

  • Loss functions: Cross-Entropy, then Quadratic Weighted Kappa (WKL)
  • Hyperparameters:
    • Learning rate: 1e-5
    • Batch size: 32
    • Epochs: 8
    • Scheduler: cosine_with_restarts
    • Weight Decay: 0.05
    • fp16: enabled
  • Metrics: QWK (Quadratic Weighted Kappa), macro F1, accuracy

Evaluation Results

Split QWK
Validation 81.9%
Test (Public) 82.8%
Blind Test* 79.0%

Usage Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the model and tokenizer
model_name = "shymaa25/barec-readability-doc-arabertv02-word-ce-wkl-strict"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Split your document into sentences (as a list)
sentences = [
    "هذه جملة سهلة.",
    "هذه الجملة أكثر تعقيدًا وتتطلب مستوى قراءة أعلى.",
    "جملة متوسطة الصعوبة."
]

# Predict readability for each sentence and select the hardest
levels = []
for sentence in sentences:
    inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
        pred = torch.argmax(outputs.logits, dim=1).item() + 1  # labels 1–19
        levels.append(pred)
doc_level = max(levels)  # Hardest sentence determines doc level

print(f"Document readability level: {doc_level}")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shymaa25/barec-readability-doc-arabertv02-word-ce-wkl-strict

Dataset used to train shymaa25/barec-readability-doc-arabertv02-word-ce-wkl-strict