Papers
arxiv:2510.21817

VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

Published on Oct 21
ยท Submitted by Yi-Fan Zhang on Oct 28
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

VITA-E, a dual-model embodied interaction framework, enables concurrent and interruptible vision-language-action capabilities, enhancing real-time user interaction and multitasking.

AI-generated summary

Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless embodied collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel embodied interaction framework designed for both behavioral concurrency and nearly real-time interruption. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an ``Active Model'' and a ``Standby Model'', allowing the embodied agent to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and interruptibly, mimicking human-like multitasking capabilities. We further propose a ``model-as-controller'' paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid platform demonstrate that VITA-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving an extremely high success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable embodied assistants.

Community

We introduce VITA-E โ€” a natural human-robot interaction framework with the ability to observe, listen, speak, and act simultaneously.

๐Ÿง  Dual-Model Architecture โ€“ Inspired by brain hemispheres: one active model executes tasks while a standby model monitors for new instructions.
๐ŸŽฏ Model Control Paradigm โ€“ Fine-tuned VLM generates special tokens as system-level commands for precise, instant control.
๐Ÿ—ฃ๏ธ Seamless Interaction โ€“ Answer questions mid-task, interrupt actions with voice commands, natural transitions, all in nearly real-time bidirectional dialogue (bilingual: EN/CN).
โœจ Real-World Validated โ€“ Tested on physical humanoid robots with competitive performance across interaction and manipulation benchmarks. Compatible with mainstream VLA models.

๐Ÿ”— Project Page: https://lxysl.github.io/VITA-E/


See More of Our Research

VITA-VLA: Exploring distilling Action Expert capabilities into VLM to enhance training efficiency and capabilities.

VITA-1.5: An open-source omni model with near real-time video-to-speech capability.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.21817 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.21817 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.21817 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.