MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources
Paper
ā¢
2509.25531
ā¢
Published
ā¢
7
'an LLM is only as good as the dataset it was trained on' - Sun Tzu