arxiv:2509.20891

AIBA: Attention-based Instrument Band Alignment for Text-to-Audio Diffusion

Published on Sep 25

Authors:

Junyoung Koh ,

Abstract

AIBA quantifies attention in text-to-audio diffusion models on the time-frequency plane using cross-attention hooks and interpretable metrics, revealing instrument-dependent trends.

AI-generated summary

We present AIBA (Attention-In-Band Alignment), a lightweight, training-free pipeline to quantify where text-to-audio diffusion models attend on the time-frequency (T-F) plane. AIBA (i) hooks cross-attention at inference to record attention probabilities without modifying weights; (ii) projects them to fixed-size mel grids that are directly comparable to audio energy; and (iii) scores agreement with instrument-band ground truth via interpretable metrics (T-F IoU/AP, frequency-profile correlation, and a pointing game). On Slakh2100 with an AudioLDM2 backbone, AIBA reveals consistent instrument-dependent trends (e.g., bass favoring low bands) and achieves high precision with moderate recall.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.20891 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.20891 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.20891 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.