view article Article We’re open-sourcing our text-to-image model and the process behind it 3 days ago • 32
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Paper • 2510.14979 • Published 30 days ago • 65
Phoenix-VAD: Streaming Semantic Endpoint Detection for Full-Duplex Speech Interaction Paper • 2509.20410 • Published Sep 24 • 2
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation Paper • 2510.02283 • Published Oct 2 • 92
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing Paper • 2509.26346 • Published Sep 30 • 18
Kimi-K2 Collection Moonshot's MoE LLMs with 1 trillion parameters, exceptional on agentic intellegence • 5 items • Updated about 22 hours ago • 144
MoDA: Multi-modal Diffusion Architecture for Talking Head Generation Paper • 2507.03256 • Published Jul 4 • 2
FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation Paper • 2508.11255 • Published Aug 15 • 11