Video-to-Video
Diffusers
Safetensors
i2v
File size: 10,350 Bytes
1970061
6e94a2a
 
b8d20bd
 
 
1970061
b8d20bd
 
 
 
 
8a5e57c
 
 
 
b8d20bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
---
base_model:
- alibaba-pai/Wan2.1-Fun-1.3B-InP
license: apache-2.0
pipeline_tag: video-to-video
library_name: diffusers
---

# ROSE: Remove Objects with Side Effects in Videos

This repository contains the finetuned WanTransformer3D weights for **ROSE**, a model for removing objects with side effects in videos.

[![Paper](https://img.shields.io/badge/Paper-arXiv:2508.18633-b31b1b?logo=arxiv)](https://huggingface.co/papers/2508.18633)
[![Project Page](https://img.shields.io/badge/Project%20Page-ROSE-1f6feb?logo=githubpages)](https://rose2025-inpaint.github.io/)
[![Code](https://img.shields.io/badge/Code-GitHub-181717?logo=github)](https://github.com/Kunbyte-AI/ROSE)
[![Demo](https://img.shields.io/badge/Demo-HuggingFace-FFD21E?logo=huggingface)](https://huggingface.co/spaces/Kunbyte/ROSE)

## Abstract
Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object's effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios.

## Dependencies and Installation

1.  **Clone Repo**
    ```bash
    git clone https://github.com/Kunbyte-AI/ROSE.git
    ```

2.  **Create Conda Environment and Install Dependencies**
    ```bash
    # create new anaconda env
    conda create -n rose python=3.12 -y
    conda activate rose

    # install python dependencies
    pip3 install -r requirements.txt
    ```
    -   CUDA = 12.4
    -   PyTorch = 2.6.0
    -   Torchvision = 0.21.0
    -   Other required packages in `requirements.txt`

## Usage (Quick Test)

To get started, you need to prepare the pretrained models first.

1.  **Prepare pretrained models**
    We use pretrained [`Wan2.1-Fun-1.3B-InP`](https://huggingface.co/alibaba-pai/Wan2.1-Fun-1.3B-InP) as our base model. During training, we only train the WanTransformer3D part and keep other parts frozen. You can download the weight of Transformer3D of ROSE from this [`link`](https://huggingface.co/Kunbyte/ROSE).

    For local inference, the `weights` directory should be arranged like this:
    ```
    weights
     β”œβ”€β”€ transformer
       β”œβ”€β”€ config.json
       β”œβ”€β”€ diffusion_pytorch_model.safetensors
    ```

    Also, it's necessary to prepare the base model in the models directory. You can download the Wan2.1-Fun-1.3B-InP base model from this [`link`](https://huggingface.co/alibaba-pai/Wan2.1-Fun-1.3B-InP).

    The `models` directory will be arranged like this:
    ```
    models
     β”œβ”€β”€ Wan2.1-Fun-1.3B-InP
       β”œβ”€β”€ google
         β”œβ”€β”€ umt5-xxl
           β”œβ”€β”€ spiece.model
           β”œβ”€β”€ special_tokens_map.json
               ...
       β”œβ”€β”€ xlm-roberta-large
         β”œβ”€β”€ sentencepiece.bpe.model
         β”œβ”€β”€ tokenizer_config.json
             ...
     β”œβ”€β”€ config.json
     β”œβ”€β”€ configuration.json
     β”œβ”€β”€ diffusion_pytorch_model.safetensors
     β”œβ”€β”€ models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth
     β”œβ”€β”€ models_t5_umt5-xxl-enc-bf16.pth
     β”œβ”€β”€ Wan2.1_VAE.pth
    ```

2.  **Run Inference**
    We provide some examples in the [`data/eval`](https://github.com/Kunbyte-AI/ROSE/tree/main/data/eval) folder. Run the following command to try it out:
    ```shell
    python inference.py \
      --validation_videos "path/to/your/video.mp4" \
      --validation_masks "path/to/your/mask.mp4" \
      --validation_prompts "" \
      --output_dir "./output" \
      --video_length 16 \
      --sample_size 480 720
    ```
    For more options, refer to the usage information in the GitHub repository:
    ```
    Usage:

    python inference.py [options]

    Options:
      --validation_videos  Path(s) to input videos
      --validation_masks   Path(s) to mask videos
      --validation_prompts Text prompts (default: [""])
      --output_dir         Output directory
      --video_length       Number of frames per video (It needs to be 16n+1.)
      --sample_size        Frame size: height width (default: 480 720)

    ```
    An interactive demo is also available on [Hugging Face Spaces](https://huggingface.co/spaces/Kunbyte/ROSE).

## Results

### Shadow
<table>
  <thead>
    <tr>
      <th>Masked Input</th>
      <th>Output</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Shadow/example-2/masked.gif" width="100%"></td>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Shadow/example-2/output.gif" width="100%"> </td>
    </tr>
    <tr>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Shadow/example-7/masked.gif" width="100%"></td>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Shadow/example-7/output.gif" width="100%"></td>
    </tr>
  </tbody>
</table>

### Reflection
<table>
  <thead>
    <tr>
      <th>Masked Input</th>
      <th>Output</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Reflection/example-1/masked.gif" width="100%"></td>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Reflection/example-1/output.gif" width="100%"></td>
    </tr>
    <tr>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Reflection/example-2/masked.gif" width="100%"></td>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Reflection/example-2/output.gif" width="100%"></td>
    </tr>
  </tbody>
</table>

### Common
<table>
  <thead>
    <tr>
      <th>Masked Input</th>
      <th>Output</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Common/example-3/masked.gif" width="100%"></td>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Common/example-3/output.gif" width="100%"></td>
    </tr>
    <tr>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Common/example-15/masked.gif" width="100%"></td>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Common/example-15/output.gif" width="100%"></td>
    </tr>
  </tbody>
</table>

### Light Source
<table>
  <thead>
    <tr>
      <th>Masked Input</th>
      <th>Output</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Light_source/example-4/masked.gif" width="100%"></td>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Light_source/example-4/output.gif" width="100%"></td>
    </tr>
    <tr>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Light_source/example-10/masked.gif" width="100%"></td>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Light_source/example-10/output.gif" width="100%"></td>
    </tr>
  </tbody>
</table>

### Translucent
<table>
  <thead>
    <tr>
      <th>Masked Input</th>
      <th>Output</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Translucent/example-4/masked.gif" width="100%"></td>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Translucent/example-4/output.gif" width="100%"></td>
    </tr>
    <tr>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Translucent/example-5/masked.gif" width="100%"></td>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Translucent/example-5/output.gif" width="100%"></td>
    </tr>
  </tbody>
</table>

### Mirror
<table>
  <thead>
    <tr>
      <th>Masked Input</th>
      <th>Output</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Mirror/example-1/masked.gif" width="100%"></td>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Mirror/example-1/output.gif" width="100%"></td>
    </tr>
    <tr>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Mirror/example-2/masked.gif" width="100%"></td>
      <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Mirror/example-2/output.gif" width="100%"></td>
    </tr>
  </tbody>
</table>

## Overview
![overall_structure](https://github.com/Kunbyte-AI/ROSE/raw/main/assets/rose_pipeline.png)

## Citation

If you find our repo useful for your research, please consider citing our paper:

```bibtex
@article{miao2025rose,
   title={ROSE: Remove Objects with Side Effects in Videos}, 
   author={Miao, Chenxuan and Feng, Yutong and Zeng, Jianshu and Gao, Zixiang and Liu, Hantang and Yan, Yunfeng and Qi, Donglian and Chen, Xi and Wang, Bin and Zhao, Hengshuang},
   journal={arXiv preprint arXiv:2508.18633},
   year={2025}
}
```

## Acknowledgement

This code is based on [Wan2.1-Fun-1.3B-Inpaint](https://github.com/aigc-apps/VideoX-Fun) and some code are brought from [ProPainter](https://github.com/sczhou/ProPainter). Thanks for their awesome works!