RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System Paper • 2602.02488 • Published Feb 2 • 36
UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? Paper • 2603.03241 • Published 23 days ago • 86
DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation Paper • 2601.22153 • Published Jan 29 • 74
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision Paper • 2601.19798 • Published Jan 27 • 42
The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding Paper • 2512.19693 • Published Dec 22, 2025 • 67
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling Paper • 2511.20785 • Published Nov 25, 2025 • 188
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe Paper • 2511.16334 • Published Nov 20, 2025 • 94
Scaling Spatial Intelligence with Multimodal Foundation Models Paper • 2511.13719 • Published Nov 17, 2025 • 48
LLaVA-Video Collection Models focus on video understanding (previously known as LLaVA-NeXT-Video). • 8 items • Updated Feb 21, 2025 • 64
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model Paper • 2509.00676 • Published Aug 31, 2025 • 85
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion Paper • 2509.01215 • Published Sep 1, 2025 • 51
SPAR: Scholar Paper Retrieval with LLM-based Agents for Enhanced Academic Search Paper • 2507.15245 • Published Jul 21, 2025 • 11
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding Paper • 2507.15028 • Published Jul 20, 2025 • 21
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding Paper • 2507.15028 • Published Jul 20, 2025 • 21
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding Paper • 2507.15028 • Published Jul 20, 2025 • 21 • 1