File size: 3,659 Bytes
1e844f8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
---
tags:
- text-to-image
- lora
- diffusers
- template:diffusion-lora
widget:
- text: "Pancakes with chocolate syrup nuts and bananas stack of whole flapjack tasty breakfast\t"
output:
url: images/ezgif-3bd98eb13984e5.gif
base_model: ali-vilab/text-to-video-ms-1.7b
instance_prompt: null
license: apache-2.0
---
# MCM-Simplified
<Gallery />
## Model description
---
license: apache-2.0
tags:
- text-to-video
- motion-consistency
- distillation
---
# Motion Consistency Model - Simplified Implementation
This model is a distilled version of the [Motion Consistency Model](https://github.com/yhZhai/mcm), trained on a subset of WebVid 2M with additional filtered image-caption pairs from the LAION aesthetic dataset.
## Sample Generated Videos
| Caption | Teacher (ModelScope) - 50 DDIM Steps | Student (First Setup) - 4 Steps | Student (Second Setup) - 4 Steps |
|---------|--------------------------------------|---------------------------------|---------------------------------|
| Worker slicing a piece of meat. |  |  |  |
| Pancakes with chocolate syrup, nuts, and bananas. |  |  |  |
## Training Details
- **Dataset:** 3022 video-caption pairs from WebVid 2M
- **Image Pairs:**
- **Setup 1:** 20K filtered LAION aesthetic images (min. resolution 450×450)
- **Setup 2:** 7.5K filtered LAION aesthetic images (min. resolution 1024×1024)
### Training Configurations
#### Setup 1
- LR: 5e-6, Grad Accum: 4, Max Grad Norm: 10
- Discriminator LR: 5e-5, Weight: 1, Lambda R1: 1e-5
- EMA Decay: 0.95, Epochs: 7, Steps: ~5100
#### Setup 2 (Modified)
- LR: 2e-6, Grad Accum: 16, Max Grad Norm: 5
- Discriminator LR: 1e-6, Weight: 0.5, Lambda R1: 1e-4
- EMA Decay: 0.98, LR Warmup: 300 steps, Epochs: 10
## Evaluation
### Frechet Video Distance (FVD)
| Model | 1 Step | 2 Steps | 4 Steps | 8 Steps |
|-------|--------|--------|--------|--------|
| **Teacher (50 DDIM Steps)** | 2954.77 | - | - | - |
| **Student - Setup 1** | 2598.15 | 2684.24 | 3082.84 | 3914.78 |
| **Student - Setup 2** | 2589.01 | 3053.35 | 3284.69 | 3930.07 |
### CLIP Similarity (×100)
| Model | 1 Step | 2 Steps | 4 Steps | 8 Steps |
|-------|--------|--------|--------|--------|
| **Teacher (50 DDIM Steps)** | 27.88 | - | - | - |
| **Student - Setup 1** | 22.55 | 25.62 | 26.86 | 27.01 |
| **Student - Setup 2** | 20.13 | 23.41 | 25.31 | 24.62 |
## Conclusion
Setup 2 was modified to stabilize training and prevent the discriminator from overpowering the generator. The changes improved FVD scores for 1-step inference, while multi-step performance varied. CLIP similarity improved across multiple inference steps, indicating better text-to-video alignment.
## References
Original Implementation: [Motion Consistency Model](https://github.com/yhZhai/mcm)
## Download model
Weights for this model are available in Safetensors format.
[Download](/SepehrNoey/MCM-Simplified/tree/main) them in the Files & versions tab.
|