File size: 10,350 Bytes
1970061 6e94a2a b8d20bd 1970061 b8d20bd 8a5e57c b8d20bd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 |
---
base_model:
- alibaba-pai/Wan2.1-Fun-1.3B-InP
license: apache-2.0
pipeline_tag: video-to-video
library_name: diffusers
---
# ROSE: Remove Objects with Side Effects in Videos
This repository contains the finetuned WanTransformer3D weights for **ROSE**, a model for removing objects with side effects in videos.
[](https://huggingface.co/papers/2508.18633)
[](https://rose2025-inpaint.github.io/)
[](https://github.com/Kunbyte-AI/ROSE)
[](https://huggingface.co/spaces/Kunbyte/ROSE)
## Abstract
Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object's effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios.
## Dependencies and Installation
1. **Clone Repo**
```bash
git clone https://github.com/Kunbyte-AI/ROSE.git
```
2. **Create Conda Environment and Install Dependencies**
```bash
# create new anaconda env
conda create -n rose python=3.12 -y
conda activate rose
# install python dependencies
pip3 install -r requirements.txt
```
- CUDA = 12.4
- PyTorch = 2.6.0
- Torchvision = 0.21.0
- Other required packages in `requirements.txt`
## Usage (Quick Test)
To get started, you need to prepare the pretrained models first.
1. **Prepare pretrained models**
We use pretrained [`Wan2.1-Fun-1.3B-InP`](https://huggingface.co/alibaba-pai/Wan2.1-Fun-1.3B-InP) as our base model. During training, we only train the WanTransformer3D part and keep other parts frozen. You can download the weight of Transformer3D of ROSE from this [`link`](https://huggingface.co/Kunbyte/ROSE).
For local inference, the `weights` directory should be arranged like this:
```
weights
βββ transformer
βββ config.json
βββ diffusion_pytorch_model.safetensors
```
Also, it's necessary to prepare the base model in the models directory. You can download the Wan2.1-Fun-1.3B-InP base model from this [`link`](https://huggingface.co/alibaba-pai/Wan2.1-Fun-1.3B-InP).
The `models` directory will be arranged like this:
```
models
βββ Wan2.1-Fun-1.3B-InP
βββ google
βββ umt5-xxl
βββ spiece.model
βββ special_tokens_map.json
...
βββ xlm-roberta-large
βββ sentencepiece.bpe.model
βββ tokenizer_config.json
...
βββ config.json
βββ configuration.json
βββ diffusion_pytorch_model.safetensors
βββ models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth
βββ models_t5_umt5-xxl-enc-bf16.pth
βββ Wan2.1_VAE.pth
```
2. **Run Inference**
We provide some examples in the [`data/eval`](https://github.com/Kunbyte-AI/ROSE/tree/main/data/eval) folder. Run the following command to try it out:
```shell
python inference.py \
--validation_videos "path/to/your/video.mp4" \
--validation_masks "path/to/your/mask.mp4" \
--validation_prompts "" \
--output_dir "./output" \
--video_length 16 \
--sample_size 480 720
```
For more options, refer to the usage information in the GitHub repository:
```
Usage:
python inference.py [options]
Options:
--validation_videos Path(s) to input videos
--validation_masks Path(s) to mask videos
--validation_prompts Text prompts (default: [""])
--output_dir Output directory
--video_length Number of frames per video (It needs to be 16n+1.)
--sample_size Frame size: height width (default: 480 720)
```
An interactive demo is also available on [Hugging Face Spaces](https://huggingface.co/spaces/Kunbyte/ROSE).
## Results
### Shadow
<table>
<thead>
<tr>
<th>Masked Input</th>
<th>Output</th>
</tr>
</thead>
<tbody>
<tr>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Shadow/example-2/masked.gif" width="100%"></td>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Shadow/example-2/output.gif" width="100%"> </td>
</tr>
<tr>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Shadow/example-7/masked.gif" width="100%"></td>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Shadow/example-7/output.gif" width="100%"></td>
</tr>
</tbody>
</table>
### Reflection
<table>
<thead>
<tr>
<th>Masked Input</th>
<th>Output</th>
</tr>
</thead>
<tbody>
<tr>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Reflection/example-1/masked.gif" width="100%"></td>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Reflection/example-1/output.gif" width="100%"></td>
</tr>
<tr>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Reflection/example-2/masked.gif" width="100%"></td>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Reflection/example-2/output.gif" width="100%"></td>
</tr>
</tbody>
</table>
### Common
<table>
<thead>
<tr>
<th>Masked Input</th>
<th>Output</th>
</tr>
</thead>
<tbody>
<tr>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Common/example-3/masked.gif" width="100%"></td>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Common/example-3/output.gif" width="100%"></td>
</tr>
<tr>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Common/example-15/masked.gif" width="100%"></td>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Common/example-15/output.gif" width="100%"></td>
</tr>
</tbody>
</table>
### Light Source
<table>
<thead>
<tr>
<th>Masked Input</th>
<th>Output</th>
</tr>
</thead>
<tbody>
<tr>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Light_source/example-4/masked.gif" width="100%"></td>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Light_source/example-4/output.gif" width="100%"></td>
</tr>
<tr>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Light_source/example-10/masked.gif" width="100%"></td>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Light_source/example-10/output.gif" width="100%"></td>
</tr>
</tbody>
</table>
### Translucent
<table>
<thead>
<tr>
<th>Masked Input</th>
<th>Output</th>
</tr>
</thead>
<tbody>
<tr>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Translucent/example-4/masked.gif" width="100%"></td>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Translucent/example-4/output.gif" width="100%"></td>
</tr>
<tr>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Translucent/example-5/masked.gif" width="100%"></td>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Translucent/example-5/output.gif" width="100%"></td>
</tr>
</tbody>
</table>
### Mirror
<table>
<thead>
<tr>
<th>Masked Input</th>
<th>Output</th>
</tr>
</thead>
<tbody>
<tr>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Mirror/example-1/masked.gif" width="100%"></td>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Mirror/example-1/output.gif" width="100%"></td>
</tr>
<tr>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Mirror/example-2/masked.gif" width="100%"></td>
<td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Mirror/example-2/output.gif" width="100%"></td>
</tr>
</tbody>
</table>
## Overview

## Citation
If you find our repo useful for your research, please consider citing our paper:
```bibtex
@article{miao2025rose,
title={ROSE: Remove Objects with Side Effects in Videos},
author={Miao, Chenxuan and Feng, Yutong and Zeng, Jianshu and Gao, Zixiang and Liu, Hantang and Yan, Yunfeng and Qi, Donglian and Chen, Xi and Wang, Bin and Zhao, Hengshuang},
journal={arXiv preprint arXiv:2508.18633},
year={2025}
}
```
## Acknowledgement
This code is based on [Wan2.1-Fun-1.3B-Inpaint](https://github.com/aigc-apps/VideoX-Fun) and some code are brought from [ProPainter](https://github.com/sczhou/ProPainter). Thanks for their awesome works! |