ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation
Abstract
ViSAudio, an end-to-end framework using conditional flow matching, generates high-quality binaural audio from silent video, providing spatial immersion and consistency across various acoustic conditions.
Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- StereoSync: Spatially-Aware Stereo Audio Generation from Video (2025)
- ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation (2025)
- Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation (2025)
- 3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation (2025)
- UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions (2025)
- Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video (2025)
- Controllable Audio-Visual Viewpoint Generation from 360° Spatial Information (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper