OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts Paper • 2503.22952 • Published Mar 29 • 17
DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness Paper • 2503.22677 • Published Mar 28 • 5
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization Paper • 2504.00999 • Published Apr 1 • 93
TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization Paper • 2503.19901 • Published Mar 25 • 41
RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy Paper • 2503.24388 • Published Mar 31 • 30
Unicorn: Text-Only Data Synthesis for Vision Language Model Training Paper • 2503.22655 • Published Mar 28 • 39
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 Paper • 2503.24376 • Published Mar 31 • 38