Instructions to use lijiang/Omni-Diffusion with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lijiang/Omni-Diffusion with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("lijiang/Omni-Diffusion", dtype="auto") - Notebooks
- Google Colab
- Kaggle
metadata
license: apache-2.0
pipeline_tag: any-to-any
library_name: transformers
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
Omni-Diffusion is the first any-to-any multimodal language model built entirely on a mask-based discrete diffusion model. It unifies understanding and generation across text, speech, and images by modeling a joint distribution over discrete multimodal tokens.
- Paper: Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
- Project Page: https://omni-diffusion.github.io
- Repository: https://github.com/VITA-MLLM/Omni-Diffusion
Usage
As the model uses a custom architecture, it can be loaded using the transformers library with trust_remote_code=True:
from transformers import AutoModel
model = AutoModel.from_pretrained("lijiang/Omni-Diffusion", trust_remote_code=True)
For detailed inference instructions and environment setup (including required image and audio tokenizers), please refer to the official GitHub repository.
Citation
If you find this work helpful for your research, please consider citing:
@article{li2026omni,
title={Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion},
author={Li, Lijiang and Long, Zuwei and Shen, Yunhang and Gao, Heting and Cao, Haoyu and Sun, Xing and Shan, Caifeng and He, Ran and Fu, Chaoyou},
journal={arXiv preprint arXiv:2603.06577},
year={2026}
}