Papers
arxiv:2510.19592

Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

Published on Oct 22
· Submitted by Hyun, Jeongseok on Oct 23
Authors:
,
,
,
,

Abstract

Decomposed Attention Fusion (DecAF) enhances video object segmentation by refining attention maps from multimodal large language models without retraining.

AI-generated summary

Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks. The code will be available at https://github.com/HYUNJS/DecAF.

Community

Paper author Paper submitter

We introduce DecAF (Decomposed Attention Fusion) — a training-free framework that transforms MLLMs’ attention maps into video segmentation. DecAF refines noisy attention through two attention fusion mechanisms: (1) Contrastive object-background fusion and (2) Complementary video-frame fusion.
Additionally, with our attention-guided SAM2 prompting strategy, DecAF obtains fine-grained masks and achieves performance comparable to training-based methods — all without retraining.

decaf_teaser

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.19592 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.19592 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.19592 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.