@codelion on Hugging Face: "The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix We…"

Post

379

The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix

We trained a GPT-2 model to 90%+ performance using just 1/10th the training data through 50+ systematic experiments on dataset mixing strategies.

Key Finding:

A static mix of 50% finePDFs + 30% DCLM-baseline + 20% FineWeb-Edu consistently outperforms complex curriculum learning approaches. Static mixing is simpler, faster, and avoids catastrophic failures from hard distribution shifts.

Results:

Our GPT-2-70M model (70M parameters, 1B tokens) scores 38.15% on benchmarks vs GPT-2's 39.13% - only 0.98 points behind despite 10x less data and 44% fewer parameters. It even beats GPT-2 on TruthfulQA (47.31% vs 40.69%).

The takeaway: careful dataset curation matters more than total data volume.

Model: codelion/gpt-2-70m

Datasets: https://huggingface.co/collections/codelion/pre-training-dataset-samples

Full blog: https://huggingface.co/blog/codelion/optimal-dataset-mixing

Join the conversation