UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models

📝 Abstract

Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset's utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis.

🎯 Overview

Overview of the UltraVoice Dataset Construction and Stylistic Coverage. The figure illustrates the complete pipeline and capabilities of UltraVoice: (1) The upper left section presents our four-step construction process: text corpus curation, style injection & response generation, stylized speech synthesis, and quality control & filtering. (2) The ring chart on the right visualizes the dataset's hierarchical control structure, with six main control dimensions in the inner ring (Emotion, Speed, Volume, Accent, Language, Composite) and their finer-grained sub-dimensions in the outer ring. (3) The lower panel showcases representative examples from each speech style dimension, demonstrating UltraVoice's rich stylistic coverage and multi-dimensional controllability, including emotion (e.g., angry, happy), speed (e.g., fast, slow), volume (e.g., high, low), language (e.g., Chinese, Japanese, Korean), accent (e.g., AU, CA, GB, IN, SG, ZA), and composite styles that combine multiple control attributes.

🤖 Available Models

This repository contains four fine-tuned models based on SLAM-Omni and VocalNet architectures:

Model Name	Speech Encoder	LLM Backbone	Model Size	Speech Decoder
SLAM-Omni-0.5B	Whisper-small-v3	Qwen2	0.5B	CosyVoice1
VocalNet-1B	Whisper-large-v3	LLaMA3.2	1B	CosyVoice2
VocalNet-7B	Whisper-large-v3	Qwen2.5	7B	CosyVoice2
VocalNet-8B	Whisper-large-v3	LLaMA3.1	8B	CosyVoice2

All model checkpoints are available in the repository directories.

For training and inference, please refer to the resources available at SLAM-Omni and VocalNet.

📊 Performance

1. Performance on Fine-Grained Speech Style Control

We evaluated our fine-tuned models on UltraVoice internal test sets using two key metrics: Instruction Following Rate (IFR) for measuring instruction compliance and Mean Opinion Score (MOS) for assessing subjective naturalness.

IFR (%) results across six fine-grained speech style control dimensions for each model. Each radar chart contrasts the base model (Blue) and its SFT variant (Red), with GPT-4o (Gray) used as an upper-bound reference. Fine-tuning with UltraVoice significantly boosts instruction-following capability, with IFR gains ranging from 14.61 to 40.09 percentage points. This improvement is particularly pronounced for smaller models with weaker baseline performance. For instance, the IFR of SLAM-Omni-0.5B surged from 28.30% to 68.39%, while VocalNet-1B's score increased from 36.28% to 55.91%.

MOS results across six fine-grained speech style control dimensions for each model. The third row of each group shows the relative gain (%) achieved by SFT. All models exhibit significant improvements in MOS after being fine-tuned with UltraVoice. The relative gains range from 29.12% to 42.33%, with the Emotion and Accent dimensions showing particularly remarkable improvements. For instance, the overall MOS for VocalNet-7B increased from 2.73 to 3.59, while VocalNet-8B's score rose from 2.85 to 3.68. These results indicate that our fine-tuning process enhances the models' ability to render the specified styles with high naturalness, demonstrating that improved instruction control does not come at the cost of audio quality.

2. General Conversational Ability

To verify that fine-tuning on UltraVoice enhances rather than compromises general conversational skills, we evaluated our models on URO-Bench, a comprehensive benchmark for spoken dialogue models.

Evaluation of our SFT models (upper part) and existing strong baselines (lower part) on URO-Bench (EN). Und.: Understanding. Conv.: Oral Conversation. Our results confirm that fine-tuning spoken dialogue models on UltraVoice enhances, rather than compromises, general conversational skills. All models showed substantial gains across Understanding, Reasoning, and Oral Conversation, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Notably, the VocalNet-7B SFT model achieves state-of-the-art performance, outperforming strong baselines like Qwen2.5-Omni-7B and GLM4-Voice-9B, highlighting practical value beyond style control.

📄 License

These models are licensed under the MIT License. See the LICENSE file for details.

📖 Citation

If you use these models in your research, please consider citing:

@article{tu2025ultravoice,
  title={UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models},
  author={Tu, Wenming and Yang, Guanrou and Yan, Ruiqi and Chen, Wenxi and Ma, Ziyang and Kang, Yipeng and Yu, Kai and Chen, Xie and Zheng, Zilong},
  journal={arXiv preprint arXiv:2510.22588},
  year={2025},
}

🙏 Acknowledgements

This work builds upon several outstanding projects and research contributions:

SLAM-LLM: We are grateful to the SLAM-LLM framework for providing a robust toolkit for speech and audio processing with large language models, which served as a foundation for our model training infrastructure.
SLAM-Omni: We acknowledge the SLAM-Omni work for pioneering timbre-controllable voice interaction systems and demonstrating effective single-stage training approaches.
VocalNet: We thank the VocalNet team for their innovative multi-token prediction approach for speech LLMs, which inspired our baseline model selection and evaluation.
EmoVoice: We appreciate the EmoVoice project for advancing emotional text-to-speech synthesis with LLM-based approaches, which informed our controllable TTS validation experiments.
URO-Bench: We are grateful for the URO-Bench benchmark, which provided a comprehensive evaluation framework for assessing the general conversational abilities of our fine-tuned spoken dialogue models.

We also thank the open-source community for their valuable tools and datasets that made this research possible.

📧 Contact & Support

For questions, issues, or feedback:

GitHub Issues: https://github.com/bigai-nlco/UltraVoice/issues
Project Page: https://bigai-nlco.github.io/UltraVoice
Dataset: https://huggingface.co/datasets/tutu0604/UltraVoice

🔗 Related Resources

Dataset: UltraVoice on HuggingFace
Training&Inference Code: SLAM-LLM | VocalNet
Project Page: UltraVoice Official Website
Paper: arXiv:2510.22588

⭐ If you find UltraVoice-SFT models useful, please consider giving us a star on GitHub! ⭐

🎤 Try our models and share your feedback!

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support