โจ Any-to-Any & World-Model : one step forward to the real world - BAAI Emu 3.5 - Antgroup Ming-flash-omni - HunyuanWorld-Mirror: 3D
Aligning with the โworld modelโ globally
โจ Audio & Speech + Video & Visual: released from entertainment labs to delivery platforms - SoulX-Podcast TTS - LongCat-Audio-Codec & LongCat-Video by Meituan delivery paltform - xiabs DreamOmni 2
fine-tuning a 14B model with TRL + SFT on a free Colab (T4 GPU)? thanks to the latest TRL optimizations, you actually can! sharing a new notebook showing how to do it ๐
Gave a smol ๐ค intro to Agents using smolagents last Monday! Sharing the slides in case you're curious. They serve as a gentle first step into the Agents Course we developed at @huggingface ๐ซถ๐ซถ
โจ 48B total/ 3B active - MIT license โจ Up to 1M context โจ 84.3 on RULER (128k) with 3.98ร speedup โจ Hybrid KDA + MLA architecture for peak throughput & quality
Sharing the slides from yesterday's talk about "Fine Tuning with TRL" from the @TogetherAgent x @huggingface workshop we hosted in our Paris office ๐!
โจ Built on Ling-Flash-2.0: 10B total/6B active โจ Generative segmentation-as-editing โจ SOTA contextual & dialect ASR โจ High-fidelity image generation
โจ Compresses long sequences visually to bypass token limits โจ Reduces computational and memory costs โจ Preserves meaning through multimodal encoding โจ Built on GLM-4.1V-9B-Base
โจ Any prior in โ 3D world out โจ Mix camera, intrinsics, depth as priors โจ Predict point clouds, normals, Gaussians & more in one pass โจ Unified architecture for all 3D task
Finally, our new paper is out! "๐๐ถ๐ป๐ฒ๐ฉ๐ถ๐๐ถ๐ผ๐ป: ๐ข๐ฝ๐ฒ๐ป ๐๐ฎ๐๐ฎ ๐๐ ๐๐น๐น ๐ฌ๐ผ๐ ๐ก๐ฒ๐ฒ๐ฑ"! ๐ฅณ FineVision: Open Data Is All You Need (2510.17269)
If you've ever trained a VLM, you know this problem: nobody shares their data mixtures. It's a black box, making replicating SOTA work impossible. We wanted to change that.
FineVision unifies 200 sources into 24 million samples. With 17.3 million images and 9.5 billion answer tokens, it's the largest open resource of its kind.
In the paper, we share how we built it: ๐ finding and cleaning data at scale ๐งน removing excessive duplicates across sources ๐ค decontaminating against 66 public benchmarks
My favorite part is Figure 6 (in the video!). It's our visual diversity analysis. It shows that FineVision isn't just bigger; it's more balanced and conceptually richer than other open datasets. NVIDIA's Eagle 2 paper highlighted just how critical this visual diversity is, and our results confirm it: models trained on FineVision consistently outperform those trained on any other open dataset on 11 benchmarks!
๐ To celebrate the paper, Iโm also releasing a concatenated and shuffled version of the full dataset! ๐HuggingFaceM4/FineVision_full_shuffled
Itโs ready to stream, so you can start training your own models right away:
from datasets import load_dataset d = load_dataset("HuggingFaceM4/FineVision_full_shuffled", split="train", streaming=True) print(next(iter(d)))
A big shoutout to the first authors: Luis Wiedmann and Orr Zohar. They are rockstars!