OmniVoice 🌍
OmniVoice is a massive multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture that directly maps text to multi-codebook acoustic tokens.
By leveraging a 581k-hour multilingual dataset and initialization from a pre-trained LLM, OmniVoice achieves the broadest language coverage to date and delivers state-of-the-art performance across Chinese, English, and diverse multilingual benchmarks.
Paper | GitHub | Demo | Project Page
Key Features
- 600+ Languages Supported: The broadest language coverage among zero-shot TTS models.
- Voice Cloning: State-of-the-art voice cloning quality from short reference audio.
- Voice Design: Control voices via assigned speaker attributes (gender, age, pitch, dialect/accent, whisper, etc.).
- Fast Inference: RTF as low as 0.025 (40x faster than real-time).
- Diffusion Language Model Architecture: A clean, streamlined, and scalable design that delivers both quality and speed.
Installation
pip
# Install PyTorch and Torchaudio first (refer to official site for CUDA versions)
pip install torch torchaudio
# Install OmniVoice
pip install omnivoice
Python API
Voice Cloning
Clone a voice from a short reference audio. Provide ref_audio and ref_text:
from omnivoice import OmniVoice
import torch
import torchaudio
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cuda:0",
dtype=torch.float16
)
audio = model.generate(
text="Hello, this is a test of zero-shot voice cloning.",
ref_audio="ref.wav",
ref_text="Transcription of the reference audio.",
) # audio is a list of `torch.Tensor` with shape (1, T) at 24 kHz.
torchaudio.save("out.wav", audio[0], 24000)
Voice Design
Describe the desired voice with speaker attributes — no reference audio needed. Supported attributes: gender (male/female), age (child to elderly), pitch (very low to very high), style (whisper), English accent (American, British, etc.), and Chinese dialect (四川话, 陕西话, etc.).
audio = model.generate(
text="Hello, this is a test of zero-shot voice design.",
instruct="female, low pitch, british accent",
)
Expressive Control
OmniVoice supports inline non-verbal symbols within the input text:
audio = model.generate(text="[laughter] You really got me. I didn't see that coming at all.")
Supported tags: [laughter], [confirmation-en], [question-en], [surprise-ah], [sniff], [sigh], and more.
Citation
@article{zhu2026omnivoice,
title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
journal={arXiv preprint arXiv:2604.00688},
year={2026}
}
- Downloads last month
- 661