nuha-mlm

Model Summary

nuha-mlm is an Arabic BERT model pre-trained from scratch on Jordanian social media text using masked language modelling (MLM). It serves as the domain-adapted base for the NUHA classifier family — nuha-binary and nuha-multiclass — and was developed as part of a pilot proof-of-concept for the NUHA project by the Jordan Open Source Association (JOSA).

Starting from a fresh vocabulary trained on the NUHA corpus rather than fine-tuning an existing Arabic BERT, the model is adapted to the vocabulary and linguistic patterns of colloquial Jordanian Arabic social media text.

Uses

Direct Use

nuha-mlm can be used for masked token prediction on Arabic social media text, or as a base model for fine-tuning downstream Arabic NLP tasks — particularly those involving informal Jordanian or Levantine Arabic.

from transformers import pipeline

fill_mask = pipeline("fill-mask", model="thejosango/nuha-mlm")
results = fill_mask("هذه المرأة [MASK] جداً")
for r in results:
    print(r["token_str"], r["score"])

Downstream Use

The primary intended downstream use is fine-tuning for hate speech and gender-based violence detection using the NUHA dataset, as demonstrated by nuha-binary and nuha-multiclass.

Out-of-Scope Use

This model was pre-trained on a relatively small, domain-specific corpus of Jordanian social media comments. It is not suitable as a general-purpose Arabic language model and should not be used as a replacement for models trained on broader Arabic corpora (e.g. for MSA tasks, translation, or summarisation).

Bias, Risks, and Limitations

  • Corpus bias: The pre-training corpus consists entirely of social media comments, many of which contain offensive or harmful language. The model may reflect biases present in that content.
  • Dialect coverage: The corpus is primarily Jordanian Arabic. Performance on other Arabic dialects or Modern Standard Arabic is not guaranteed.
  • Pilot scale: As part of an initial proof-of-concept effort, the pre-training corpus size and training duration are modest. A larger corpus and longer training would likely yield a stronger base model.
  • Vocabulary size: The tokenizer vocabulary (17,513 tokens) was trained specifically on this corpus and is much smaller than general Arabic BERT vocabularies.

Training Details

Training Data

Pre-trained on the text portion of the thejosango/nuha-dataset — a corpus of Arabic social media comments collected from Jordanian platforms. The binary dataset split was used for pre-training (text only, labels ignored).

Training Procedure

  • Architecture: BERT (BertForMaskedLM), 12 layers, 12 attention heads, 768 hidden size
  • Vocabulary: Custom BPE tokenizer trained from scratch on the corpus (17,513 tokens), with [URL] as an added special token
  • MLM probability: 15%
  • Optimizer: Adam (β₁=0.9, β₂=0.999, ε=1e-8)
  • Learning rate: 1e-4 with linear schedule, 5,000 warmup steps
  • Batch size: 128
  • Epochs: 25
  • Framework: Transformers 4.32.1, PyTorch 2.0.1

Training Results

Epoch Validation Loss
1 8.435
5 7.296
10 6.499
15 6.118
20 5.863
25 5.788

This model was developed as part of an initial pilot study. It is intended as a stepping stone for the downstream NUHA classifiers rather than as a general-purpose Arabic language model.

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thejosango/nuha-mlm

Finetunes
2 models
Quantizations
1 model

Dataset used to train thejosango/nuha-mlm