Bolmo: Byteifying the Next Generation of Language Models Paper • 2512.15586 • Published Dec 17, 2025 • 17
Bolmo: Byteifying the Next Generation of Language Models Paper • 2512.15586 • Published Dec 17, 2025 • 17
Economies of Open Intelligence: Tracing Power & Participation in the Model Ecosystem Paper • 2512.03073 • Published Nov 27, 2025 • 6
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures Paper • 2510.24081 • Published Oct 28, 2025 • 20
The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models Paper • 2510.13996 • Published Oct 15, 2025 • 9
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models Paper • 2405.05417 • Published May 8, 2024 • 1
Retrofitting (Large) Language Models with Dynamic Tokenization Paper • 2411.18553 • Published Nov 27, 2024 • 2
Cross-Tokenizer Distillation via Approximate Likelihood Matching Paper • 2503.20083 • Published Mar 25, 2025 • 1
view post Post 2111 The folks at Foursquare released a dataset of 104.5 million places of interest ( foursquare/fsq-os-places) and here's all of them on a plot See translation 4 replies · 🔥 5 5 🚀 1 1 😔 1 1 + Reply
view post Post 2452 The Lichess database of games, puzzles, and engine evaluations is now on the Hub: Lichess Billions of chess data points to download, query, and stream and we're excited to see what you'll build with it! ♟️ 🤗- https://huggingface.co/collections/Lichess/positions-datasets-66f50837db5cd3287d60d489- https://huggingface.co/collections/Lichess/games-datasets-66f508df78f4b43e1bb2d353 See translation 👍 7 7 ❤️ 2 2 🔥 1 1 + Reply
Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation Paper • 2406.16678 • Published Jun 24, 2024 • 16
Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation Paper • 2305.18893 • Published May 30, 2023 • 2
CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models Paper • 2305.14214 • Published May 23, 2023