Abstract
VibeVoice synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer, achieving superior performance and fidelity.
This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.
Community
This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling (2025)
- DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis (2025)
- Marco-Voice Technical Report (2025)
- Step-Audio 2 Technical Report (2025)
- Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations (2025)
- ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching (2025)
- SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
VibeVoice sounds like a real breakthrough in scaling speech synthesis—especially the 80x compression without losing fidelity. The ability to handle multi-speaker, long-form audio up to 90 minutes is impressive and could open new possibilities for interactive podcasts, audiobooks, or even website integrations where natural, dynamic voices enhance user experience. I’m curious to see how this tokenizer approach might influence future multimodal models. What applications do you think will benefit most from this leap?
اعتراف ...
العالم بالدولة الفلسطينية !
استثمار سعودي لطوفان الأقصى .
الذكاء يغلب الشجاعة
Models citing this paper 26
Browse 26 models citing this paperDatasets citing this paper 0
No dataset linking this paper
 
					 
					 
						