Papers
arxiv:2511.08521

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

Published on Nov 11
Β· Submitted by Hao Fei on Nov 14
Β· UniVA-Agent UniVA
Authors:
,
,
,
,
,
,
,
,

Abstract

UniVA is an open-source multi-agent framework that integrates video understanding, segmentation, editing, and generation into cohesive workflows using a Plan-and-Act architecture and hierarchical memory.

AI-generated summary

While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation rightarrow multi-round editing rightarrow object segmentation rightarrow compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)

Community

Paper author Paper submitter

πŸš€ UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
πŸ”— Paper | 🌐 Website & Demo | πŸ’» Code

🧩 Key Highlights

  • 🌍 End-to-End Unified Video Generalist β€” An one-stop omni video creation framework, bridging understanding, reasoning, editing, tracking and generation into one foundation.
  • πŸ€– Agentic Video Creation β€” Plan–Act dual agents that understand, reason, and create videos interactively.
  • 🎬 Proactive Workflow β€” UniVA iterates with you like a director: plans shots, refines scenes, and suggests better stories.
  • 🧠 Deep Memory & Intent Understanding β€” Keeps global + user memory for style, lore, and preferences consistency.
  • 🏭 Industrial-Grade Production β€” Any-conditioned pipeline with cinematic quality, long-form consistency, and cross-modal editing.
  • βš™ MCP-Based Modular Extensible Ecosystem β€” MCP-native, modular plug-and-play tool servers, enabling infinite expansion with new tools and models.
  • 🧾 UniVA-Bench β€” A benchmark for agentic video intelligence across multi-step compositional tasks.
  • ✨ Open β€” UniVA is fully open-source, omni-capable, and ever-extensible.
This comment has been hidden (marked as Spam)

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.08521 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.08521 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.08521 in a Space README.md to link it from this page.

Collections including this paper 3