Abstract
World-in-World evaluates generative world models in closed-loop environments, emphasizing task success over visual quality and revealing insights into controllability, data scaling, and compute allocation.
Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.
Community
arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/world-in-world-world-models-in-a-closed-loop-world
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Ctrl-World: A Controllable Generative World Model for Robot Manipulation (2025)
- A Comprehensive Survey on World Models for Embodied AI (2025)
- F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions (2025)
- Learning Primitive Embodied World Models: Towards Scalable Robotic Learning (2025)
- Embodied Navigation Foundation Model (2025)
- Planning with Reasoning using Vision Language World Model (2025)
- OmniNav: A Unified Framework for Prospective Exploration and Visual-Language Navigation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
