---
license: apache-2.0
language:
- sv
- da
- no
- is
library_name: transformers
tags:
- fill-mask
- modernbert
- scandinavian
- swedish
- danish
- norwegian
- icelandic
base_model: jhu-clsp/mmBERT-base
datasets:
- HPLT/HPLT2.0_cleaned
---

# scandmmBERT: A ModernBERT Specialized for Scandinavian Languages

## Model Description

**scandmmBERT** is a masked language model based on `jhu-clsp/mmBERT-base` that has undergone continued pre-training on a large corpus of Scandinavian languages (Swedish, Danish, Norwegian, and Icelandic).

The original `mmBERT` is a powerful multilingual model trained on over 1,800 languages. This version specializes that broad knowledge by exposing it to a massive amount of high-quality, in-domain text, making it a powerful expert model for any Scandinavian NLU task.

This project was developed as a hands-on exploration of large-scale model training on high-performance computing resources. The full development and troubleshooting process is detailed in the corresponding GitHub repository: [https://github.com/joenaess/scandmmBERT](https://github.com/joenaess/scandmmBERT).

## Intended Uses & Limitations

This model is intended to be used as a base for fine-tuning on specific downstream tasks. It is particularly well-suited for:
* Text Classification (e.g., sentiment analysis, topic classification)
* Named Entity Recognition (NER)
* Question Answering

#### Limitations
* This is a masked language model and is not suitable for text generation.
* The model has not been fine-tuned for any specific task and should be adapted to your use case.
* The model inherits potential biases and stereotypes present in the web-crawled training data (`HPLT 2.0`).

## How to Use

You can use this model directly with the `fill-mask` pipeline for masked word prediction.

```python
from transformers import pipeline

# Replace YOUR_USERNAME with your actual Hugging Face username
model_id = "YOUR_USERNAME/scandmmBERT-base-scandinavian"
unmasker = pipeline('fill-mask', model=model_id)

# Swedish
result_sv = unmasker("Sveriges huvudstad heter <mask>.")
print([r['token_str'] for r in result_sv])

# Danish
result_da = unmasker("Dronningen af Danmark hedder <mask>.")
print([r['token_str'] for r in result_da])
```

## Training Procedure

### Pre-training Data

The model was trained on a combined and interleaved stream of the following language subsets from the [HPLT/HPLT2.0_cleaned](https://huggingface.co/datasets/HPLT/HPLT2.0_cleaned) dataset:
* Icelandic (`isl_Latn`)
* Norwegian Nynorsk (`nno_Latn`)
* Swedish (`swe_Latn`)
* Danish (`dan_Latn`)
* Norwegian Bokmål (`nob_Latn`)

Due to storage constraints, the smaller Icelandic and Nynorsk datasets were used in their entirety, while the larger Swedish, Danish, and Bokmål datasets were sampled.

### Hyperparameters

The continued pre-training was performed for 50,000 steps using the following configuration:

| Hyperparameter                | Value      |
| ----------------------------- | ---------- |
| `learning_rate`               | `2e-5`     |
| `per_device_train_batch_size` | `2`        |
| `gradient_accumulation_steps` | `16`       |
| **Effective Batch Size** | **64** |
| `max_steps`                   | `50,000`   |
| `optimizer`                   | AdamW      |
| `precision`                   | `bf16`     |
| `max_seq_length`              | `512`      |

The training was performed on a server with 2x NVIDIA L4 GPUs (24GB VRAM each) using PyTorch, Hugging Face `transformers`, and `accelerate`. The environment was managed with `pixi`.

## Evaluation

A simple qualitative evaluation using the `fill-mask` pipeline shows strong performance in predicting contextually relevant words in Scandinavian languages.

**Swedish 🇸🇪**
* **Input:** `Sveriges huvudstad heter <mask>.`
* **Top Prediction:** `Stockholm`

**Danish 🇩🇰**
* **Input:** `Dronningen af Danmark hedder <mask>.`
* **Top Prediction:** `Margrethe`

**Norwegian 🇳🇴**
* **Input:** `Norges mest berømte maler er Edvard <mask>.`
* **Top Prediction:** `Munch`

## Citation

If you use this model in your work, please consider citing the original `mmBERT` and `HPLT` sources, and you can cite this model as:

```bibtex
@misc{scandmmbert2025,
  author    = {Jonas Lind},
  title     = {scandmmBERT: A ModernBERT Specialized for Scandinavian Languages},
  year      = {2025},
  publisher = {Hugging Face},
  journal   = {Hugging Face repository},
  howpublished = {\url{[https://huggingface.co/YOUR_USERNAME/scandmmBERT-base-scandinavian](https://huggingface.co/YOUR_USERNAME/scandmmBERT-base-scandinavian)}}
}
```