nuha-mlm

Model Summary

nuha-mlm is an Arabic BERT model pre-trained from scratch on Jordanian social media text using masked language modelling (MLM). It serves as the domain-adapted base for the NUHA classifier family — nuha-binary and nuha-multiclass — and was developed as part of a pilot proof-of-concept for the NUHA project by the Jordan Open Source Association (JOSA).

Starting from a fresh vocabulary trained on the NUHA corpus rather than fine-tuning an existing Arabic BERT, the model is adapted to the vocabulary and linguistic patterns of colloquial Jordanian Arabic social media text.

Uses

Direct Use

nuha-mlm can be used for masked token prediction on Arabic social media text, or as a base model for fine-tuning downstream Arabic NLP tasks — particularly those involving informal Jordanian or Levantine Arabic.

from transformers import pipeline

fill_mask = pipeline("fill-mask", model="thejosango/nuha-mlm")
results = fill_mask("هذه المرأة [MASK] جداً")
for r in results:
    print(r["token_str"], r["score"])

Downstream Use

The primary intended downstream use is fine-tuning for hate speech and gender-based violence detection using the NUHA dataset, as demonstrated by nuha-binary and nuha-multiclass.

Out-of-Scope Use

This model was pre-trained on a relatively small, domain-specific corpus of Jordanian social media comments. It is not suitable as a general-purpose Arabic language model and should not be used as a replacement for models trained on broader Arabic corpora (e.g. for MSA tasks, translation, or summarisation).

Bias, Risks, and Limitations

Corpus bias: The pre-training corpus consists entirely of social media comments, many of which contain offensive or harmful language. The model may reflect biases present in that content.
Dialect coverage: The corpus is primarily Jordanian Arabic. Performance on other Arabic dialects or Modern Standard Arabic is not guaranteed.
Pilot scale: As part of an initial proof-of-concept effort, the pre-training corpus size and training duration are modest. A larger corpus and longer training would likely yield a stronger base model.
Vocabulary size: The tokenizer vocabulary (17,513 tokens) was trained specifically on this corpus and is much smaller than general Arabic BERT vocabularies.

Training Details

Training Data

Pre-trained on the text portion of the thejosango/nuha-dataset — a corpus of Arabic social media comments collected from Jordanian platforms. The binary dataset split was used for pre-training (text only, labels ignored).

Training Procedure

Architecture: BERT (BertForMaskedLM), 12 layers, 12 attention heads, 768 hidden size
Vocabulary: Custom BPE tokenizer trained from scratch on the corpus (17,513 tokens), with [URL] as an added special token
MLM probability: 15%
Optimizer: Adam (β₁=0.9, β₂=0.999, ε=1e-8)
Learning rate: 1e-4 with linear schedule, 5,000 warmup steps
Batch size: 128
Epochs: 25
Framework: Transformers 4.32.1, PyTorch 2.0.1

Training Results

Epoch	Validation Loss
1	8.435
5	7.296
10	6.499
15	6.118
20	5.863
25	5.788

This model was developed as part of an initial pilot study. It is intended as a stepping stone for the downstream NUHA classifiers rather than as a general-purpose Arabic language model.

Downloads last month: 10

Model tree for thejosango/nuha-mlm

Finetunes

2 models

Quantizations

1 model

thejosango
/

nuha-mlm