|
|
--- |
|
|
pipeline_tag: text-to-video |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- Wan-AI/Wan2.1-T2V-14B |
|
|
tags: |
|
|
- Audio-driven |
|
|
--- |
|
|
<div align="center"> |
|
|
<h1>OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation</h1> |
|
|
|
|
|
|
|
|
[Qijun Gan](https://agnjason.github.io/) · [Ruizi Yang](https://github.com/ZiziAmy/) · [Jianke Zhu](https://person.zju.edu.cn/en/jkzhu) · [Shaofei Xue]() · [Steven Hoi](https://scholar.google.com/citations?user=JoLjflYAAAAJ) |
|
|
|
|
|
Zhejiang University, Alibaba Group |
|
|
|
|
|
<div align="center"> |
|
|
<a href="https://omni-avatar.github.io/"><img src="https://img.shields.io/badge/Project-OmniAvatar-blue.svg"></a>   |
|
|
<a href="http://arxiv.org/abs/2506.18866"><img src="https://img.shields.io/badge/Arxiv-2506.18866-b31b1b.svg?logo=arXiv"></a>   |
|
|
<a href="https://huggingface.co/OmniAvatar/OmniAvatar-14B"><img src="https://img.shields.io/badge/🤗-OmniAvatar-red.svg"></a> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
 |
|
|
|
|
|
## 🔥 Latest News!! |
|
|
* June 24-th, 2025: We released the inference code and model weights! |
|
|
|
|
|
|
|
|
## Quickstart |
|
|
### 🛠️Installation |
|
|
|
|
|
Clone the repo: |
|
|
|
|
|
``` |
|
|
git clone [email protected]:Omni-Avatar/OmniAvatar.git |
|
|
cd OmniAvatar |
|
|
``` |
|
|
|
|
|
Install dependencies: |
|
|
``` |
|
|
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124 |
|
|
pip install -r requirements.txt |
|
|
# Optional to install flash_attn to accelerate attention computation |
|
|
pip install flash_attn |
|
|
``` |
|
|
|
|
|
### 🧱Model Download |
|
|
| Models | Download Link | Notes | |
|
|
| --------------|-------------------------------------------------------------------------------|-------------------------------| |
|
|
| Wan2.1-T2V-14B | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) | Base model for 14B |
|
|
| OmniAvatar model 14B | 🤗 [Huggingface](https://huggingface.co/OmniAvatar/OmniAvatar-14B) | Our LoRA and audio condition weights |
|
|
| Wav2Vec | 🤗 [Huggingface](https://huggingface.co/facebook/wav2vec2-base-960h) | Audio encoder |
|
|
|
|
|
Download models using huggingface-cli: |
|
|
``` sh |
|
|
mkdir pretrained_models |
|
|
pip install "huggingface_hub[cli]" |
|
|
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./pretrained_models/Wan2.1-T2V-14B |
|
|
huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./pretrained_models/wav2vec2-base-960h |
|
|
huggingface-cli download OmniAvatar/OmniAvatar-14B --local-dir ./pretrained_models/OmniAvatar-14B |
|
|
``` |
|
|
|
|
|
#### File structure (Samples for 14B) |
|
|
```shell |
|
|
OmniAvatar |
|
|
├── pretrained_models |
|
|
│ ├── Wan2.1-T2V-14B |
|
|
│ │ ├── ... |
|
|
│ ├── OmniAvatar-14B |
|
|
│ │ ├── config.json |
|
|
│ │ └── pytorch_model.pt |
|
|
│ └── wav2vec2-base-960h |
|
|
│ ├── ... |
|
|
``` |
|
|
|
|
|
### 🔑 Inference |
|
|
|
|
|
|
|
|
``` sh |
|
|
# 480p only for now |
|
|
torchrun --standalone --nproc_per_node=1 scripts/inference.py --config configs/inference.yaml --input_file examples/infer_samples.txt |
|
|
``` |
|
|
|
|
|
#### 💡Tips |
|
|
- You can control the character's behavior through the prompt in `examples/infer_samples.txt`, and its format is `[prompt]@@[img_path]@@[audio_path]`. **The recommended range for prompt and audio cfg is [4-6]. You can increase the audio cfg to achieve more consistent lip-sync.** |
|
|
|
|
|
- Control prompts guidance and audio guidance respectively, and use `audio_scale=3` to control audio guidance separately. At this time, `guidance_scale` only controls prompts. |
|
|
|
|
|
- To speed up, the recommanded `num_steps` range is [20-50], more steps bring higher quality. To use multi-gpu inference, just set `sp_size=$GPU_NUM`. To use [TeaCache](https://github.com/ali-vilab/TeaCache), you can set `tea_cache_l1_thresh=0.14` , and the recommanded range is [0.05-0.15]. |
|
|
- To reduce GPU memory storage, you can set `use_fsdp=True` and `num_persistent_param_in_dit`. An example command is as follows: |
|
|
```bash |
|
|
torchrun --standalone --nproc_per_node=8 scripts/inference.py --config configs/inference.yaml --input_file examples/infer_samples.txt --hp=sp_size=8,max_tokens=30000,guidance_scale=4.5,overlap_frame=13,num_steps=25,use_fsdp=True,tea_cache_l1_thresh=0.14,num_persistent_param_in_dit=7000000000 |
|
|
``` |
|
|
|
|
|
We present a detailed table here. The model is tested on A800. |
|
|
|
|
|
|`model_size`|`torch_dtype`|`GPU_NUM`|`use_fsdp`|`num_persistent_param_in_dit`|Speed|Required VRAM| |
|
|
|-|-|-|-|-|-|-| |
|
|
|14B|torch.bfloat16|1|False|None (unlimited)|16.0s/it|36G| |
|
|
|14B|torch.bfloat16|1|False|7*10**9 (7B)|19.4s/it|21G| |
|
|
|14B|torch.bfloat16|1|False|0|22.1s/it|8G| |
|
|
|14B|torch.bfloat16|4|True|None (unlimited)|4.8s/it|14.3G| |
|
|
|
|
|
We train train 14B under `30000` tokens for `480p` videos. We found that using more tokens when inference can also have good results. You can try `60000`, `80000`. Overlap `overlap_frame` can be set as `1` or `13`. `13` could have more coherent generation, but error propagation is more severe. |
|
|
|
|
|
- ❕Prompts are also very important. It is recommended to `[Description of first frame]`- `[Description of human behavior]`-`[Description of background (optional)]` |
|
|
|
|
|
## 🧩 Community Works |
|
|
We ❤️ contributions from the open-source community! If your work has improved OmniAvatar, please inform us. |
|
|
Or you can directly e-mail [[email protected]](mailto:[email protected]). We are happy to reference your project for everyone's convenience. **🥸Have Fun!** |
|
|
|
|
|
## 🔗Citation |
|
|
If you find this repository useful, please consider giving a star ⭐ and citation |
|
|
``` |
|
|
@misc{gan2025omniavatar, |
|
|
title={OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation}, |
|
|
author={Qijun Gan and Ruizi Yang and Jianke Zhu and Shaofei Xue and Steven Hoi}, |
|
|
year={2025}, |
|
|
eprint={2506.18866}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2506.18866}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
Thanks to [Wan2.1](https://github.com/Wan-Video/Wan2.1), [FantasyTalking](https://github.com/Fantasy-AMAP/fantasy-talking) and [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio) for open-sourcing their models and code, which provided valuable references and support for this project. Their contributions to the open-source community are truly appreciated. |