Masked Diffusion Language Model - Bimodal Gaussian Schedule
๐ค Oral Presentation at BabyLM Workshop @ EMNLP 2025
This model is a Masked Diffusion Language Model (MDLM) trained with a Bimodal Gaussian noise schedule and frequency-informed masking for the BabyLM Challenge 2025.
Model Details
- Model Type: Masked Diffusion Language Model
- Training Data: BabyLM corpus (100M words, strict track)
- Sequence Length: 512 tokens
- Noise Schedule: Bimodal Gaussian
- Masking Strategy: Frequency-informed with curriculum learning
- Tokenizer: BPE with 16,384 vocabulary size
Training Approach
This model uses a diffusion-based training objective that combines:
- Bimodal Gaussian noise schedule
- Bidirectional context modeling
- Frequency-informed masking (prioritizing rare tokens)
- NELBO weighting with derivative softening (ฮณ = 0.1)
Performance
Performance on BabyLM Challenge zero-shot tasks:
| Task | Score |
|---|---|
| BLiMP | 78.2 |
| BLiMP Supplement | 73.6 |
| EWoK | 52.5 |
| COMPS | 56.6 |
| Entity Tracking | 39.7 |
Usage
from transformers import AutoTokenizer
import torch
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("despoinakk/diffusion_gaussian_babylm")
# Load model (custom modeling code required)
# See: https://github.com/DespoinaKK/babylm-diffusion
Citation
If you use this model, please cite:
TBA
Links
- Paper: arXiv:2509.05056
- Code: GitHub Repository
- Cosine Schedule Model: despoinakk/diffusion_cosine_babylm
Contact
- Despoina Kosmopoulou: [email protected]
- Efthymios Georgiou: [email protected]
Acknowledgments
Based on work from:
- Downloads last month
- 39