WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Abstract
WorldMM, a novel multimodal memory agent with episodic, semantic, and visual memory, outperforms existing methods in long video question-answering by adaptively retrieving from multiple temporal scales and memory sources.
Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.
Community
We present WorldMM, a novel dynamic multimodal memory agent designed for long video reasoning. It constructs multimodal, multi-scale memories that capture both textual and visual information, and employs adaptive retrieval across multiple memories with reasoning.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory (2025)
- LiCoMemory: Lightweight and Cognitive Agentic Memory for Efficient Long-Term Reasoning (2025)
- Agentic Learner with Grow-and-Refine Multimodal Semantic Memory (2025)
- CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation (2025)
- VideoLucy: Deep Memory Backtracking for Long Video Understanding (2025)
- Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs (2025)
- MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper