---
license: other
license_name: license-term-of-unified-audio-schema
language:
- en
- zh
tags:
- audio
- speech
- sound
- music
- audio-understanding
- ASR
- audio-captioning
- TTS
- audio-language-model
- audio-llm
- speech-to-text
- text-to-speech
- multimodal
base_model:
- Qwen/Qwen2.5-7B
pipeline_tag: audio-text-to-text
---

# Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

**Unified Audio Schema** is a novel holistic framework for audio supervision that disentangles and restructures supervision across **transcription**, **paralinguistics**, and **non-linguistic events**.

📄 [Paper](https://arxiv.org/abs/2604.12506) | 💻 [GitHub](https://github.com/Tencent/Unified_Audio_Schema)

This repository provides our model checkpoints trained using **Unified Audio Schema**. For the complete codebase, please refer to the corresponding [GitHub repository](https://github.com/Tencent/Unified_Audio_Schema).

## Model Details

| Attribute | Value |
|:----------|:------|
| Input Modality | Text and audio |
| Output Modality | Text and audio |
| Base LLM | [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) |
| Audio Encoder | AuT encoder |
| Input Audio Representation Frame Rate | 12.5 Hz |
| Output Audio Token Codebook Size | 8,192 |
| Output Audio Token Frame Rate | 25 Hz |

Notes:
- The model supports interleaved text and audio input/output, enabling flexible multimodal interactions.
- Speech waveform reconstruction for generated audio tokens relies on the [StableToken](https://huggingface.co/tencent/StableToken) decoder.

## Quick Start

### Installation

```bash
git clone --recursive https://github.com/Tencent/Unified_Audio_Schema.git
cd Unified_Audio_Schema && pip install -r requirements.txt
```

### Download Checkpoints

```bash
# Model weights
huggingface-cli download tencent/Unified_Audio_Schema --local-dir checkpoints/Unified_Audio_Schema

# StableToken decoder (required for speech waveform reconstruction)
huggingface-cli download tencent/StableToken --local-dir checkpoints/StableToken
```

## Inference

```python
import torch
import torchaudio
from src.model import UASAudio

model = UASAudio(
	model_path="checkpoints/Unified_Audio_Schema",
	audio_decoder_path="checkpoints/StableToken/decoder",
	device="cuda" if torch.cuda.is_available() else "cpu",
)

dialogue_system_prompt = (
	"User will provide you with a speech instruction. Do it step by step. "
	"First, think about the instruction and respond in a interleaved manner, "
	"with 13 text token followed by 52 audio tokens."
)

messages = [
	{"role": "system", "content": dialogue_system_prompt},
	{
		"role": "user",
		"content": [
			{"type": "audio", "audio": "assets/give_me_a_brief_introduction_to_the_great_wall.wav"},
		],
	},
	{"role": "assistant", "content": None},
]

generation_config = {
    "max_new_tokens": 4096,
    "temperature": 0.7,
    "repetition_penalty": 1.05,
    "top_p": 0.9,
    "do_sample": True
}

_, text, audio_tokens = model(messages, **generation_config)
print(text)

if len(audio_tokens) > 0:
	audio_array, sampling_rate = model.tokens_to_audio(audio_tokens)
	torchaudio.save("response.wav", audio_array, sampling_rate)
```

## Supported Scenarios

Our model can be applied to a wide range of audio understanding and generation tasks, including:

- Text-input conversation
- Speech-input conversation
- Automatic Speech Recognition (ASR)
- Audio captioning
- Text-to-Speech (TTS)

For more runnable examples, please refer to [`example_usage.ipynb`](https://github.com/Tencent/Unified_Audio_Schema/blob/main/example_usage.ipynb) in the GitHub repository.

## Evaluation Highlights

UAS-Audio demonstrates strong performance on audio understanding, ASR, and TTS benchmarks.

### Audio Understanding

| **Model** | MMSU<br>(Percep.) | MMSU<br>(Reason.) | **MMSU<br>(Overall)** | MMAR<br>(Speech) | MMAR<br>(Sound) | MMAR<br>(Music) | **MMAR<br>(Overall)** | MMAU<br>(Speech) | MMAU<br>(Sound) | MMAU<br>(Music) | **MMAU<br>(Overall)** | **Avg.** |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| [Kimi-Audio](https://github.com/MoonshotAI/Kimi-Audio) | <u>44.8</u> | 75.7 | <u>59.8</u> | 58.5 | 49.7 | 33.0 | 48.0 | 62.2 | 75.7 | 66.8 | 68.2 | 58.7 |
| [Qwen2.5-Omni](https://github.com/QwenLM/Qwen2.5-Omni) | 42.7 | **77.6** | 58.1 | 59.9 | **58.8** | 40.8 | 56.7 | **70.6** | <u>78.1</u> | 65.9 | <u>71.5</u> | <u>62.1</u> |
| [Step-Audio2](https://github.com/stepfun-ai/Step-Audio2) | 42.9 | 73.2 | 57.6 | <u>61.2</u> | 54.6 | <u>42.2</u> | <u>56.8</u> | <u>68.2</u> | **79.3** | <u>68.4</u> | **72.7** | 61.9 |
| **Ours** | **55.7** | <u>77.4</u> | **66.2** | **66.0** | **58.8** | **45.2** | **60.1** | 67.0 | 70.0 | **71.3** | 69.4 | **65.2** |

### ASR & TTS

| Model | ASR<br>(LS-clean) | ASR<br>(AISHELL-1) | TTS<br>(SeedTTS-en) | TTS<br>(SeedTTS-zh) |
| :--- | :---: | :---: | :---: | :---: |
| [Qwen2.5-Omni](https://github.com/QwenLM/Qwen2.5-Omni) | - | - | 2.3 | 1.4 |
| [Step-Audio2](https://github.com/stepfun-ai/Step-Audio2) | 1.9 | 1.0 | 2.1 | 3.2 |
| [MiMo-Audio](https://github.com/XiaomiMiMo/MiMo-Audio) | 3.8 | 1.8 | 5.4 | 2.0 |
| **Ours** | 2.2 | 2.3 | 1.7 | 1.4 |

## Citation

If you find Unified Audio Schema or our model useful for your research, please cite:

```bibtex
@misc{zhang2026transcriptionunifiedaudioschema,
	title={Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs}, 
	author={Linhao Zhang and Yuhan Song and Aiwei Liu and Chuhan Wu and Sijun Zhang and Wei Jia and Yuan Liu and Houfeng Wang and Xiao Zhou},
	year={2026},
	eprint={2604.12506},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2604.12506},
}

@inproceedings{song2026stabletoken,
	title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient Speech{LLM}s},
	author={Yuhan Song and Linhao Zhang and Chuhan Wu and Aiwei Liu and Wei Jia and Houfeng Wang and Zhou Xiao},
	booktitle={The Fourteenth International Conference on Learning Representations},
	year={2026},
	url={https://openreview.net/forum?id=17DNmdQ9aU}
}
```

## License

This project is licensed under the [License Term of Unified_Audio_Schema](LICENSE).