--- license: other license_name: license-term-of-unified-audio-schema language: - en - zh tags: - audio - speech - sound - music - audio-understanding - ASR - audio-captioning - TTS - audio-language-model - audio-llm - speech-to-text - text-to-speech - multimodal base_model: - Qwen/Qwen2.5-7B pipeline_tag: audio-text-to-text --- # Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs **Unified Audio Schema** is a novel holistic framework for audio supervision that disentangles and restructures supervision across **transcription**, **paralinguistics**, and **non-linguistic events**. 📄 [Paper](https://arxiv.org/abs/2604.12506) | 💻 [GitHub](https://github.com/Tencent/Unified_Audio_Schema) This repository provides our model checkpoints trained using **Unified Audio Schema**. For the complete codebase, please refer to the corresponding [GitHub repository](https://github.com/Tencent/Unified_Audio_Schema). ## Model Details | Attribute | Value | |:----------|:------| | Input Modality | Text and audio | | Output Modality | Text and audio | | Base LLM | [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) | | Audio Encoder | AuT encoder | | Input Audio Representation Frame Rate | 12.5 Hz | | Output Audio Token Codebook Size | 8,192 | | Output Audio Token Frame Rate | 25 Hz | Notes: - The model supports interleaved text and audio input/output, enabling flexible multimodal interactions. - Speech waveform reconstruction for generated audio tokens relies on the [StableToken](https://huggingface.co/tencent/StableToken) decoder. ## Quick Start ### Installation ```bash git clone --recursive https://github.com/Tencent/Unified_Audio_Schema.git cd Unified_Audio_Schema && pip install -r requirements.txt ``` ### Download Checkpoints ```bash # Model weights huggingface-cli download tencent/Unified_Audio_Schema --local-dir checkpoints/Unified_Audio_Schema # StableToken decoder (required for speech waveform reconstruction) huggingface-cli download tencent/StableToken --local-dir checkpoints/StableToken ``` ## Inference ```python import torch import torchaudio from src.model import UASAudio model = UASAudio( model_path="checkpoints/Unified_Audio_Schema", audio_decoder_path="checkpoints/StableToken/decoder", device="cuda" if torch.cuda.is_available() else "cpu", ) dialogue_system_prompt = ( "User will provide you with a speech instruction. Do it step by step. " "First, think about the instruction and respond in a interleaved manner, " "with 13 text token followed by 52 audio tokens." ) messages = [ {"role": "system", "content": dialogue_system_prompt}, { "role": "user", "content": [ {"type": "audio", "audio": "assets/give_me_a_brief_introduction_to_the_great_wall.wav"}, ], }, {"role": "assistant", "content": None}, ] generation_config = { "max_new_tokens": 4096, "temperature": 0.7, "repetition_penalty": 1.05, "top_p": 0.9, "do_sample": True } _, text, audio_tokens = model(messages, **generation_config) print(text) if len(audio_tokens) > 0: audio_array, sampling_rate = model.tokens_to_audio(audio_tokens) torchaudio.save("response.wav", audio_array, sampling_rate) ``` ## Supported Scenarios Our model can be applied to a wide range of audio understanding and generation tasks, including: - Text-input conversation - Speech-input conversation - Automatic Speech Recognition (ASR) - Audio captioning - Text-to-Speech (TTS) For more runnable examples, please refer to [`example_usage.ipynb`](https://github.com/Tencent/Unified_Audio_Schema/blob/main/example_usage.ipynb) in the GitHub repository. ## Evaluation Highlights UAS-Audio demonstrates strong performance on audio understanding, ASR, and TTS benchmarks. ### Audio Understanding | **Model** | MMSU
(Percep.) | MMSU
(Reason.) | **MMSU
(Overall)** | MMAR
(Speech) | MMAR
(Sound) | MMAR
(Music) | **MMAR
(Overall)** | MMAU
(Speech) | MMAU
(Sound) | MMAU
(Music) | **MMAU
(Overall)** | **Avg.** | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | [Kimi-Audio](https://github.com/MoonshotAI/Kimi-Audio) | 44.8 | 75.7 | 59.8 | 58.5 | 49.7 | 33.0 | 48.0 | 62.2 | 75.7 | 66.8 | 68.2 | 58.7 | | [Qwen2.5-Omni](https://github.com/QwenLM/Qwen2.5-Omni) | 42.7 | **77.6** | 58.1 | 59.9 | **58.8** | 40.8 | 56.7 | **70.6** | 78.1 | 65.9 | 71.5 | 62.1 | | [Step-Audio2](https://github.com/stepfun-ai/Step-Audio2) | 42.9 | 73.2 | 57.6 | 61.2 | 54.6 | 42.2 | 56.8 | 68.2 | **79.3** | 68.4 | **72.7** | 61.9 | | **Ours** | **55.7** | 77.4 | **66.2** | **66.0** | **58.8** | **45.2** | **60.1** | 67.0 | 70.0 | **71.3** | 69.4 | **65.2** | ### ASR & TTS | Model | ASR
(LS-clean) | ASR
(AISHELL-1) | TTS
(SeedTTS-en) | TTS
(SeedTTS-zh) | | :--- | :---: | :---: | :---: | :---: | | [Qwen2.5-Omni](https://github.com/QwenLM/Qwen2.5-Omni) | - | - | 2.3 | 1.4 | | [Step-Audio2](https://github.com/stepfun-ai/Step-Audio2) | 1.9 | 1.0 | 2.1 | 3.2 | | [MiMo-Audio](https://github.com/XiaomiMiMo/MiMo-Audio) | 3.8 | 1.8 | 5.4 | 2.0 | | **Ours** | 2.2 | 2.3 | 1.7 | 1.4 | ## Citation If you find Unified Audio Schema or our model useful for your research, please cite: ```bibtex @misc{zhang2026transcriptionunifiedaudioschema, title={Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs}, author={Linhao Zhang and Yuhan Song and Aiwei Liu and Chuhan Wu and Sijun Zhang and Wei Jia and Yuan Liu and Houfeng Wang and Xiao Zhou}, year={2026}, eprint={2604.12506}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.12506}, } @inproceedings{song2026stabletoken, title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient Speech{LLM}s}, author={Yuhan Song and Linhao Zhang and Chuhan Wu and Aiwei Liu and Wei Jia and Houfeng Wang and Zhou Xiao}, booktitle={The Fourteenth International Conference on Learning Representations}, year={2026}, url={https://openreview.net/forum?id=17DNmdQ9aU} } ``` ## License This project is licensed under the [License Term of Unified_Audio_Schema](LICENSE).