AraMix AraMix is a SOTA Arabic pretraining dataset AdaMLLab/AraMix Viewer • Updated Jan 30 • 394M • 2.74k • 7 Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets Paper • 2512.18834 • Published Dec 21, 2025 • 4
Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets Paper • 2512.18834 • Published Dec 21, 2025 • 4
SmolTulu A collection of models that use SmolLM2 as the pretrained base in conjunction with AllenAI's Tulu 3 post training pipeline. SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs Paper • 2412.08347 • Published Dec 11, 2024 • 4 SultanR/SmolTulu-1.7b-Reinforced Text Generation • 2B • Updated Dec 17, 2024 • 8 • 5 SultanR/SmolTulu-1.7b-Instruct Text Generation • 2B • Updated Dec 17, 2024 • 91 • 13 SultanR/SmolTulu-1.7b-RM Text Classification • 2B • Updated Dec 17, 2024 • 6 • 2
SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs Paper • 2412.08347 • Published Dec 11, 2024 • 4
Fineweb-Edu-Ar Largest (as of 2024) machine translated Arabic educational corpus kaust-generative-ai/fineweb-edu-ar Viewer • Updated Nov 12, 2024 • 363M • 260 • 13 Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models Paper • 2411.06402 • Published Nov 10, 2024 • 2
Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models Paper • 2411.06402 • Published Nov 10, 2024 • 2
AraMix AraMix is a SOTA Arabic pretraining dataset AdaMLLab/AraMix Viewer • Updated Jan 30 • 394M • 2.74k • 7 Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets Paper • 2512.18834 • Published Dec 21, 2025 • 4
Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets Paper • 2512.18834 • Published Dec 21, 2025 • 4
Fineweb-Edu-Ar Largest (as of 2024) machine translated Arabic educational corpus kaust-generative-ai/fineweb-edu-ar Viewer • Updated Nov 12, 2024 • 363M • 260 • 13 Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models Paper • 2411.06402 • Published Nov 10, 2024 • 2
Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models Paper • 2411.06402 • Published Nov 10, 2024 • 2
SmolTulu A collection of models that use SmolLM2 as the pretrained base in conjunction with AllenAI's Tulu 3 post training pipeline. SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs Paper • 2412.08347 • Published Dec 11, 2024 • 4 SultanR/SmolTulu-1.7b-Reinforced Text Generation • 2B • Updated Dec 17, 2024 • 8 • 5 SultanR/SmolTulu-1.7b-Instruct Text Generation • 2B • Updated Dec 17, 2024 • 91 • 13 SultanR/SmolTulu-1.7b-RM Text Classification • 2B • Updated Dec 17, 2024 • 6 • 2
SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs Paper • 2412.08347 • Published Dec 11, 2024 • 4