--- license: apache-2.0 language: - sv - da - no - is library_name: transformers tags: - fill-mask - modernbert - scandinavian - swedish - danish - norwegian - icelandic base_model: jhu-clsp/mmBERT-base datasets: - HPLT/HPLT2.0_cleaned --- # scandmmBERT: A ModernBERT Specialized for Scandinavian Languages ## Model Description **scandmmBERT** is a masked language model based on `jhu-clsp/mmBERT-base` that has undergone continued pre-training on a large corpus of Scandinavian languages (Swedish, Danish, Norwegian, and Icelandic). The original `mmBERT` is a powerful multilingual model trained on over 1,800 languages. This version specializes that broad knowledge by exposing it to a massive amount of high-quality, in-domain text, making it a powerful expert model for any Scandinavian NLU task. This project was developed as a hands-on exploration of large-scale model training on high-performance computing resources. The full development and troubleshooting process is detailed in the corresponding GitHub repository: [https://github.com/joenaess/scandmmBERT](https://github.com/joenaess/scandmmBERT). ## Intended Uses & Limitations This model is intended to be used as a base for fine-tuning on specific downstream tasks. It is particularly well-suited for: * Text Classification (e.g., sentiment analysis, topic classification) * Named Entity Recognition (NER) * Question Answering #### Limitations * This is a masked language model and is not suitable for text generation. * The model has not been fine-tuned for any specific task and should be adapted to your use case. * The model inherits potential biases and stereotypes present in the web-crawled training data (`HPLT 2.0`). ## How to Use You can use this model directly with the `fill-mask` pipeline for masked word prediction. ```python from transformers import pipeline # Replace YOUR_USERNAME with your actual Hugging Face username model_id = "YOUR_USERNAME/scandmmBERT-base-scandinavian" unmasker = pipeline('fill-mask', model=model_id) # Swedish result_sv = unmasker("Sveriges huvudstad heter .") print([r['token_str'] for r in result_sv]) # Danish result_da = unmasker("Dronningen af Danmark hedder .") print([r['token_str'] for r in result_da]) ``` ## Training Procedure ### Pre-training Data The model was trained on a combined and interleaved stream of the following language subsets from the [HPLT/HPLT2.0_cleaned](https://huggingface.co/datasets/HPLT/HPLT2.0_cleaned) dataset: * Icelandic (`isl_Latn`) * Norwegian Nynorsk (`nno_Latn`) * Swedish (`swe_Latn`) * Danish (`dan_Latn`) * Norwegian Bokmål (`nob_Latn`) Due to storage constraints, the smaller Icelandic and Nynorsk datasets were used in their entirety, while the larger Swedish, Danish, and Bokmål datasets were sampled. ### Hyperparameters The continued pre-training was performed for 50,000 steps using the following configuration: | Hyperparameter | Value | | ----------------------------- | ---------- | | `learning_rate` | `2e-5` | | `per_device_train_batch_size` | `2` | | `gradient_accumulation_steps` | `16` | | **Effective Batch Size** | **64** | | `max_steps` | `50,000` | | `optimizer` | AdamW | | `precision` | `bf16` | | `max_seq_length` | `512` | The training was performed on a server with 2x NVIDIA L4 GPUs (24GB VRAM each) using PyTorch, Hugging Face `transformers`, and `accelerate`. The environment was managed with `pixi`. ## Evaluation A simple qualitative evaluation using the `fill-mask` pipeline shows strong performance in predicting contextually relevant words in Scandinavian languages. **Swedish 🇸🇪** * **Input:** `Sveriges huvudstad heter .` * **Top Prediction:** `Stockholm` **Danish 🇩🇰** * **Input:** `Dronningen af Danmark hedder .` * **Top Prediction:** `Margrethe` **Norwegian 🇳🇴** * **Input:** `Norges mest berømte maler er Edvard .` * **Top Prediction:** `Munch` ## Citation If you use this model in your work, please consider citing the original `mmBERT` and `HPLT` sources, and you can cite this model as: ```bibtex @misc{scandmmbert2025, author = {Jonas Lind}, title = {scandmmBERT: A ModernBERT Specialized for Scandinavian Languages}, year = {2025}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {\url{[https://huggingface.co/YOUR_USERNAME/scandmmBERT-base-scandinavian](https://huggingface.co/YOUR_USERNAME/scandmmBERT-base-scandinavian)}} } ```