arxiv:2607.00446

VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement

Published on Jul 1

· Submitted by

Authors:

Abstract

VideoSearch-R1 is an agentic framework that iteratively retrieves videos and refines search queries using continuous latent space refinement and policy optimization for improved video moment retrieval and temporal grounding.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

As video corpora continue to expand in both scale and task complexity, there is increasing demand for approaches that retrieve relevant videos from large-scale corpora (inter-video reasoning) and subsequently perform fine-grained, query-conditioned tasks (intra-video reasoning) within the retrieved content, such as temporal grounding. However, existing approaches typically treat retrieval as a preprocessing step, and consequently, when the initial retrieval fails, there is no mechanism to refine the search, leading to the failure of subsequent fine-grained intra-video reasoning. Moreover, while recent agentic frameworks have advanced video understanding, they typically assume that the query-relevant video is already given, focusing exclusively on intra-video reasoning tasks. To address these limitations, we propose VideoSearch-R1, an agentic framework for iterative video retrieval and reasoning through multi-turn interaction with a video search engine. Specifically, we introduce Soft Query Refinement (SQR) to refine search query tokens in a continuous latent space rather than rewriting queries in the discrete text space, enabling more efficient and fine-grained adjustments. SQR and its reasoning process are trained using Group Relative Policy Optimization (GRPO), guided by task-level reward signals derived from retrieval and downstream tasks. Building upon this, VideoSearch-R1 achieves state-of-the-art performance across three datasets on Video Corpus Moment Retrieval (VCMR), iteratively retrieving videos from large-scale corpora, refining search queries, and performing precise query-conditioned temporal grounding within the retrieved content. Our analyses show that SQR effectively refines the original query, requiring significantly fewer generated tokens than explicit text-level query refinement. Code and model checkpoints are publicly available at mlvlab.github.io/VideoSearch-R1.

View arXiv page View PDF Project page GitHub 21 Add to collection

Community

happy8825

Paper submitter 1 day ago

VideoSearch-R1 is an agentic framework that unifies inter-video retrieval and intra-video reasoning through multi-turn interaction with a video search engine. We introduce Soft Query Refinement (SQR), which refines query tokens in a continuous latent space instead of rewriting text, and train it with GRPO. VideoSearch-R1 reaches state-of-the-art Video Corpus Moment Retrieval (VCMR) on three benchmarks while using far fewer generated tokens than text-level refinement.