arxiv:2512.18834

Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets

Published on Dec 21, 2025

AdaML Lab @ KAUST

Upvote

Authors:

Sultan Alrashed ,

Abstract

Web corpus deduplication using cross-source agreement identifies high-quality text content, improving dataset diversity and quality for multilingual language model training.

AI-generated summary

Multilingual data from the web is essential for LLM pretraining. Yet, scraping it is expensive, and research groups repeatedly crawl the same content. For example, we found that over 40\% of tokens across major Arabic web corpora are duplicated between sources. In this work, we propose to use this wasteful redundancy as a quality signal to create high-quality pretraining datasets. Our key insight is that cross-source agreement functions as a free, model-free quality filter: content retained by multiple independent pipelines is more likely to represent high-quality text. Crucially, this signal requires no additional computation beyond standard deduplication, which is already performed at scale when pretraining language models. So, we propose MixMinMatch, a method that combines multiple existing web corpora, performs cross-dataset MinHash deduplication, and identifies documents independently recovered by multiple sources. We apply MixMinMatch to Arabic, Turkish, and Hindi, producing corpora that match or exceed the quality of the best single-source baselines, while providing up to 4times more unique tokens. On Arabic, our matched subset achieves a 4.5\% relative improvement over ArabicWeb24, while on Turkish, we improve over FineWeb-2 by 5.5\%. We release the datasets at: https://huggingface.co/collections/AdaMLLab/mixminmatch