arxiv:2511.08521

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

Published on Nov 11

· Submitted by

Hao Fei on Nov 14

UniVA

Upvote

Authors:

Zhengyang Liang ,

Rui Huang ,

Hao Fei

Abstract

UniVA is an open-source multi-agent framework that integrates video understanding, segmentation, editing, and generation into cohesive workflows using a Plan-and-Act architecture and hierarchical memory.

AI-generated summary

While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation rightarrow multi-round editing rightarrow object segmentation rightarrow compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)

View arXiv page View PDF Project page GitHub 41 Add to collection

Community

scofield7419

Paper author Paper submitter 4 days ago

🚀 UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
🔗 Paper ｜ 🌐 Website & Demo ｜ 💻 Code

🧩 Key Highlights

🌍 End-to-End Unified Video Generalist — An one-stop omni video creation framework, bridging understanding, reasoning, editing, tracking and generation into one foundation.
🤖 Agentic Video Creation — Plan–Act dual agents that understand, reason, and create videos interactively.
🎬 Proactive Workflow — UniVA iterates with you like a director: plans shots, refines scenes, and suggests better stories.
🧠 Deep Memory & Intent Understanding — Keeps global + user memory for style, lore, and preferences consistency.
🏭 Industrial-Grade Production — Any-conditioned pipeline with cinematic quality, long-form consistency, and cross-modal editing.
⚙ MCP-Based Modular Extensible Ecosystem — MCP-native, modular plug-and-play tool servers, enabling infinite expansion with new tools and models.
🧾 UniVA-Bench — A benchmark for agentic video intelligence across multi-step compositional tasks.
✨ Open — UniVA is fully open-source, omni-capable, and ever-extensible.