--- license: apache-2.0 language: - zh - en base_model: - Qwen/Qwen2.5-VL-7B-Instruct - Qwen/Qwen2-Audio-7B-Instruct ---

NEXUS-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision

Che Liu , Yingji Zhang , Dong Zhang , Weijie Zhang , Chenggong Gong , Yu Lu , Shilin Zhou , Ziliang Gan ,
Ziao Wang, Haipang Wu, Ji Liu, Andre Freitas, Qifan Wang, Zenglin Xu,
Rongjunchen Zhang^♠, Yong Dai^♠

^♠Corresponding author, daiyongya@outlook.com, zhangrongjunchen@myhexin.com

📖Paper |🤗Model | 🤗Training Data (Coming Soon)

NEXUS-O is an industry-scale omni-modal large language model (LLM) that unifies audio, vision, and language understanding into a single modular framework. Human perception integrates sight, sound, and language — NEXUS-O aims to replicate this ability for intelligent agents across real-world scenarios such as ASR, Speech-to-Speech Chat, and Multimodal Reasoning.

Architecture of NEXUS-O

Training Stages

## 📢 News - 🚀 [08/01/2025] Our paper has been accepted for ACM MM 2025. ## 💡 Highlights - 🧩 Modular End-to-End Framework. A highly configurable encoder–LLM–decoder architecture supporting flexible modality combinations and rapid iteration for industry applications. - 💡 Lightweight Alignment Strategy. Efficient audio–language pre-training built upon the state-of-the-art Qwen2.5-VL model — eliminating the need for costly vision pre-training while retaining strong tri-modal performance. - 🎧 Synthetic Audio Data Pipeline. A scalable audio synthesis system that generates diverse, high-fidelity audio-text pairs from real-world scenes, enabling robust downstream ASR and S2S tasks. ## TODO * [x] Rlease NEXUS-O full model weight on HuggingFace * [ ] Rlease Audio Encoder Training Data * [ ] Rlease Audio Decoder Training Data ## ✒️Citation ``` @article{liu2025nexus, title={Nexus: An Omni-Perceptive And-Interactive Model for Language, Audio, And Vision}, author={Liu, Che and Zhang, Yingji and Zhang, Dong and Zhang, Weijie and Gong, Chenggong and Li, Haohan and Lu, Yu and Zhou, Shilin and Lu, Yue and Gan, Ziliang and others}, journal={arXiv preprint arXiv:2503.01879}, year={2025} } ``` ## 📄 License ![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg) ![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg) **Usage and License Notices**: The data and code are intended and licensed for research use only. License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use ## 💖 Acknowledgement