UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
Abstract
UniVA is an open-source multi-agent framework that integrates video understanding, segmentation, editing, and generation into cohesive workflows using a Plan-and-Act architecture and hierarchical memory.
While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation rightarrow multi-round editing rightarrow object segmentation rightarrow compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)
Community
π UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
π Paper ο½ π Website & Demo ο½ π» Code
π§© Key Highlights
- π End-to-End Unified Video Generalist β An one-stop omni video creation framework, bridging understanding, reasoning, editing, tracking and generation into one foundation.
- π€ Agentic Video Creation β PlanβAct dual agents that understand, reason, and create videos interactively.
- π¬ Proactive Workflow β UniVA iterates with you like a director: plans shots, refines scenes, and suggests better stories.
- π§ Deep Memory & Intent Understanding β Keeps global + user memory for style, lore, and preferences consistency.
- π Industrial-Grade Production β Any-conditioned pipeline with cinematic quality, long-form consistency, and cross-modal editing.
- β MCP-Based Modular Extensible Ecosystem β MCP-native, modular plug-and-play tool servers, enabling infinite expansion with new tools and models.
- π§Ύ UniVA-Bench β A benchmark for agentic video intelligence across multi-step compositional tasks.
- β¨ Open β UniVA is fully open-source, omni-capable, and ever-extensible.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper