Diffusion Transformers with Representation Autoencoders Paper • 2510.11690 • Published 19 days ago • 160
PIG-Nav: Key Insights for Pretrained Image Goal Navigation Models Paper • 2507.17220 • Published Jul 23 • 1
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models Paper • 2507.23682 • Published Jul 31 • 23
Learning Getting-Up Policies for Real-World Humanoid Robots Paper • 2502.12152 • Published Feb 17 • 42
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Paper • 2502.14786 • Published Feb 20 • 153
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity Paper • 2502.13063 • Published Feb 18 • 72
IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI Paper • 2411.00785 • Published Oct 17, 2024 • 8
Distributional Reinforcement Learning for Multi-Dimensional Reward Functions Paper • 2110.13578 • Published Oct 26, 2021 • 1
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation Paper • 2410.05363 • Published Oct 7, 2024 • 45
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting Paper • 2410.17856 • Published Oct 23, 2024 • 51
SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model Paper • 2403.13064 • Published Mar 19, 2024 • 31
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data Paper • 2401.10891 • Published Jan 19, 2024 • 62
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities Paper • 2401.12168 • Published Jan 22, 2024 • 29
LLM in a flash: Efficient Large Language Model Inference with Limited Memory Paper • 2312.11514 • Published Dec 12, 2023 • 260
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts Paper • 2312.10763 • Published Dec 17, 2023 • 19
Holodeck: Language Guided Generation of 3D Embodied AI Environments Paper • 2312.09067 • Published Dec 14, 2023 • 16