Instructions to use thejosango/nuha-mlm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use thejosango/nuha-mlm with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="thejosango/nuha-mlm")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("thejosango/nuha-mlm") model = AutoModelForMaskedLM.from_pretrained("thejosango/nuha-mlm") - Notebooks
- Google Colab
- Kaggle
nuha-mlm
Model Summary
nuha-mlm is an Arabic BERT model pre-trained from scratch on Jordanian social media text using masked language modelling (MLM). It serves as the domain-adapted base for the NUHA classifier family — nuha-binary and nuha-multiclass — and was developed as part of a pilot proof-of-concept for the NUHA project by the Jordan Open Source Association (JOSA).
Starting from a fresh vocabulary trained on the NUHA corpus rather than fine-tuning an existing Arabic BERT, the model is adapted to the vocabulary and linguistic patterns of colloquial Jordanian Arabic social media text.
Uses
Direct Use
nuha-mlm can be used for masked token prediction on Arabic social media text, or as a base model for fine-tuning downstream Arabic NLP tasks — particularly those involving informal Jordanian or Levantine Arabic.
from transformers import pipeline
fill_mask = pipeline("fill-mask", model="thejosango/nuha-mlm")
results = fill_mask("هذه المرأة [MASK] جداً")
for r in results:
print(r["token_str"], r["score"])
Downstream Use
The primary intended downstream use is fine-tuning for hate speech and gender-based violence detection using the NUHA dataset, as demonstrated by nuha-binary and nuha-multiclass.
Out-of-Scope Use
This model was pre-trained on a relatively small, domain-specific corpus of Jordanian social media comments. It is not suitable as a general-purpose Arabic language model and should not be used as a replacement for models trained on broader Arabic corpora (e.g. for MSA tasks, translation, or summarisation).
Bias, Risks, and Limitations
- Corpus bias: The pre-training corpus consists entirely of social media comments, many of which contain offensive or harmful language. The model may reflect biases present in that content.
- Dialect coverage: The corpus is primarily Jordanian Arabic. Performance on other Arabic dialects or Modern Standard Arabic is not guaranteed.
- Pilot scale: As part of an initial proof-of-concept effort, the pre-training corpus size and training duration are modest. A larger corpus and longer training would likely yield a stronger base model.
- Vocabulary size: The tokenizer vocabulary (17,513 tokens) was trained specifically on this corpus and is much smaller than general Arabic BERT vocabularies.
Training Details
Training Data
Pre-trained on the text portion of the thejosango/nuha-dataset — a corpus of Arabic social media comments collected from Jordanian platforms. The binary dataset split was used for pre-training (text only, labels ignored).
Training Procedure
- Architecture: BERT (
BertForMaskedLM), 12 layers, 12 attention heads, 768 hidden size - Vocabulary: Custom BPE tokenizer trained from scratch on the corpus (17,513 tokens), with
[URL]as an added special token - MLM probability: 15%
- Optimizer: Adam (β₁=0.9, β₂=0.999, ε=1e-8)
- Learning rate: 1e-4 with linear schedule, 5,000 warmup steps
- Batch size: 128
- Epochs: 25
- Framework: Transformers 4.32.1, PyTorch 2.0.1
Training Results
| Epoch | Validation Loss |
|---|---|
| 1 | 8.435 |
| 5 | 7.296 |
| 10 | 6.499 |
| 15 | 6.118 |
| 20 | 5.863 |
| 25 | 5.788 |
This model was developed as part of an initial pilot study. It is intended as a stepping stone for the downstream NUHA classifiers rather than as a general-purpose Arabic language model.
- Downloads last month
- 10