MGM-Omni-TTS-2B-0927
Introduction
MGM-Omni is an omni-chatbot capable of processing text, image, video, and speech inputs, and generating both text and speech responses. MGM-Omni is capable of long-form speech understanding and generation, as well as zero-shot voice cloning in both Chinese and English. MGM-Omni-TTS-2B-0927 is the SpeechLM component of MGM-Omni for speech generation. Compared with MGM-Omni-TTS-2B, it is more robust in corner cases, such as reading mathematical formulas and URLs. For the MLLM part, please refer MGM-Omni.
Main Properties
- Omni-modality supports: MGM-Omni supports audio, video, image, and text inputs, understands long contexts, and can generate both text and speech outputs, making it a truly versatile multi-modal AI assistant.
- Long-form Speech Understanding: Unlike most existing open-source multi-modal models, which typically fail with inputs longer than 15 minutes, MGM-Omni can handle hour-long speech inputs while delivering superior overall and detailed understanding and performance!
- Long-form Speech Generation: With a treasure trove of training data and smart Chunk-Based Decoding, MGM-Omni can generate over 10 minutes of smooth, natural speech for continuous storytelling.
- Streaming Generation: Thanks to the parallel decoding approach for speech tokens, MGM-Omni enables efficient and smooth streaming audio, making it suitable for live conversations.
- Zero-shot Voice Cloning: With MGM-Omniโs extensive and diverse audio training, you can create a customized voice clone by simply recording a short clip (around 10 seconds) and reviewing the results.
- Fully Open-source: All the code, models, and training data will be released.
Sample Usage
Zero-Shot Voice Cloning
Generate audio that sounds similar to the provided reference audio.
python -m mgm.serve.cli_tts \
--model wcy1122/MGM-Omni-TTS-2B-0927 \
--ref-audio assets/ref_audio/Man_EN.wav
Add --ref-audio-text for a more accurate reference audio transcript. Otherwise, Whisper-large-v3 will be used for automatic transcription.
Evaluation
Speech and Audio Understanding
| Model | Date | LS-cleanโ | LS-otherโ | CM-ENโ | CM-ZHโ | AISHELLโ |
|---|---|---|---|---|---|---|
| Mini-Omni2 | 2024-11 | 4.7 | 9.4 | - | - | - |
| Lyra | 2024-12 | 2.0 | 4.0 | - | - | - |
| VITA-1.5 | 2025-01 | 3.4 | 7.5 | - | - | 2.2 |
| Qwen2.5-Omni | 2025-03 | 1.6 | 3.5 | 7.6 | 5.2 | - |
| Ola | 2025-06 | 1.9 | 4.3 | - | - | - |
| MGM-Omni-7B | 2025-08 | 1.7 | 3.6 | 8.8 | 4.5 | 1.9 |
| MGM-Omni-32B | 2025-08 | 1.5 | 3.2 | 8.0 | 4.0 | 1.8 |
This table presents WER and CER results on speech understanding. Here LS refers to LibriSpeech and CM refers to Common Voice.
| Model | Date | Speechโ | Soundโ | Musicโ | Mixโ | Averageโ |
|---|---|---|---|---|---|---|
| LLaMA-Omni | 2024-08 | 5.2 | 5.3 | 4.3 | 4.0 | 4.7 |
| Mini-Omni2 | 2024-11 | 3.6 | 3.5 | 2.6 | 3.1 | 3.2 |
| IXC2.5-OmniLive | 2024-12 | 1.6 | 1.8 | 1.7 | 1.6 | 1.7 |
| VITA-1.5 | 2025-01 | 4.8 | 5.5 | 4.9 | 2.9 | 4.5 |
| Qwen2.5-Omni | 2025-03 | 6.8 | 5.7 | 4.8 | 5.4 | 5.7 |
| Ola | 2025-06 | 7.3 | 6.4 | 5.9 | 6.0 | 6.4 |
| MGM-Omni-7B | 2025-08 | 7.3 | 6.5 | 6.3 | 6.1 | 6.5 |
| MGM-Omni-32B | 2025-08 | 7.1 | 6.5 | 6.2 | 6.2 | 6.5 |
This table presents evaluation results on AIR-Bench Chat (speech, sound, music, etc.).
Speech Generation
| Model | Date | Model Size | CERโ | SS(ZH)โ | WERโ | SS(EN)โ |
|---|---|---|---|---|---|---|
| CosyVoice2 | 2024-12 | 0.5B | 1.45 | 0.748 | 2.57 | 0.652 |
| Qwen2.5-Omni-3B | 2025-03 | 0.5B | 1.58 | 0.744 | 2.51 | 0.635 |
| Qwen2.5-Omni-7B | 2025-03 | 2B | 1.42 | 0.754 | 2.33 | 0.641 |
| MOSS-TTSD-v0 | 2025-06 | 2B | 2.18 | 0.594 | 2.46 | 0.476 |
| HiggsAudio-v2 | 2025-07 | 6B | 1.66 | 0.743 | 2.44 | 0.677 |
| MGM-Omni | 2025-08 | 0.6B | 1.42 | 0.750 | 2.48 | 0.670 |
| MGM-Omni | 2025-08 | 2B | 1.28 | 0.755 | 2.28 | 0.684 |
| MGM-Omni | 2025-08 | 4B | 1.18 | 0.758 | 2.22 | 0.686 |
This table presents evaluation results on speech generation on seed-tts-eval. For Qwen2.5-Omni, model size refers to the size of the talker.
| Model | Date | Model Size | EN WERโ | ZH CERโ | EN-hard WERโ | ZH-hard WERโ |
|---|---|---|---|---|---|---|
| CosyVoice2(chunk) | 2024-12 | 0.5B | 14.80 | 5.27 | 42.48 | 32.76 |
| MOSS-TTSD-v0.5 | 2025-06 | 6B | 8.69 | 6.82 | 62.61 | 62.97 |
| HiggsAudio-v2 | 2025-07 | 6B | 27.09 | 31.39 | 98.61 | 98.85 |
| MGM-Omni | 2025-08 | 2B | 4.98 | 5.58 | 26.26 | 23.58 |
This table presents evaluation results on long-form and hard speech generation on long-tts-eval.
Citation
If you find this repo useful for your research, we would appreciate it if you could cite our work:
@article{wang2025mgm,
title={MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech},
author={Wang, Chengyao and Zhong, Zhisheng and Peng, Bohao and Yang, Senqiao and Liu, Yuqi and Gui, Haokun and Xia, Bin and Li, Jingyao and Yu, Bei and Jia, Jiaya},
journal={arXiv preprint arXiv:2509.25131},
year={2025}
}
@inproceedings{zhong2025lyra,
title={Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition},
author={Zhong, Zhingsheng and Wang, Chengyao and Liu, Yuqi and Yang, Senqiao and Tang, Longxiang and Zhang, Yuechen and Li, Jingyao and Qu, Tianyuan and Li, Yanwei and Chen, Yukang and Yu, Shaozuo and Wu, Sitong and Lo, Eric and Liu, Shu and Jia, Jiaya},
booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
year={2025}
}
@article{li2024mgm,
title={Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models},
author={Li, Yanwei and Zhang, Yuechen and Wang, Chengyao and Zhong, Zhisheng and Chen, Yixin and Chu, Ruihang and Liu, Shaoteng and Jia, Jiaya},
journal={arXiv:2403.18814},
year={2024}
}
- Downloads last month
- 450