File size: 3,659 Bytes

1e844f8

---
tags:
- text-to-image
- lora
- diffusers
- template:diffusion-lora
widget:
- text: "Pancakes with chocolate syrup nuts and bananas stack of whole flapjack tasty breakfast\t"
  output:
    url: images/ezgif-3bd98eb13984e5.gif
base_model: ali-vilab/text-to-video-ms-1.7b
instance_prompt: null
license: apache-2.0
---
# MCM-Simplified

<Gallery />

## Model description 

---
license: apache-2.0
tags:
  - text-to-video
  - motion-consistency
  - distillation
---

# Motion Consistency Model - Simplified Implementation

This model is a distilled version of the [Motion Consistency Model](https:&#x2F;&#x2F;github.com&#x2F;yhZhai&#x2F;mcm), trained on a subset of WebVid 2M with additional filtered image-caption pairs from the LAION aesthetic dataset.

## Sample Generated Videos
| Caption | Teacher (ModelScope) - 50 DDIM Steps | Student (First Setup) - 4 Steps | Student (Second Setup) - 4 Steps |
|---------|--------------------------------------|---------------------------------|---------------------------------|
| Worker slicing a piece of meat. | ![Image](https:&#x2F;&#x2F;github.com&#x2F;user-attachments&#x2F;assets&#x2F;49aadf34-d0cd-4531-829d-8237f17dd659) | ![Image](https:&#x2F;&#x2F;github.com&#x2F;user-attachments&#x2F;assets&#x2F;2bbea26d-6f76-4ac8-baa2-6600aa49697e) | ![Image](https:&#x2F;&#x2F;github.com&#x2F;user-attachments&#x2F;assets&#x2F;37ec6102-a63f-40d9-a601-6ae50c8453fd) |
| Pancakes with chocolate syrup, nuts, and bananas. | ![Image](https:&#x2F;&#x2F;github.com&#x2F;user-attachments&#x2F;assets&#x2F;8a82fd29-b54c-45a6-b79e-72c88d7d8ce4) | ![Image](https:&#x2F;&#x2F;github.com&#x2F;user-attachments&#x2F;assets&#x2F;c003cc5a-563d-44f3-9121-3fd3db4caef4) | ![Image](https:&#x2F;&#x2F;github.com&#x2F;user-attachments&#x2F;assets&#x2F;c201b303-822c-4ba4-a600-0023c8884aa4) |

## Training Details

- **Dataset:** 3022 video-caption pairs from WebVid 2M  
- **Image Pairs:**  
  - **Setup 1:** 20K filtered LAION aesthetic images (min. resolution 450×450)  
  - **Setup 2:** 7.5K filtered LAION aesthetic images (min. resolution 1024×1024)  

### Training Configurations
#### Setup 1
- LR: 5e-6, Grad Accum: 4, Max Grad Norm: 10
- Discriminator LR: 5e-5, Weight: 1, Lambda R1: 1e-5
- EMA Decay: 0.95, Epochs: 7, Steps: ~5100

#### Setup 2 (Modified)
- LR: 2e-6, Grad Accum: 16, Max Grad Norm: 5
- Discriminator LR: 1e-6, Weight: 0.5, Lambda R1: 1e-4
- EMA Decay: 0.98, LR Warmup: 300 steps, Epochs: 10

## Evaluation

### Frechet Video Distance (FVD)
| Model | 1 Step | 2 Steps | 4 Steps | 8 Steps |
|-------|--------|--------|--------|--------|
| **Teacher (50 DDIM Steps)** | 2954.77 | - | - | - |
| **Student - Setup 1** | 2598.15 | 2684.24 | 3082.84 | 3914.78 |
| **Student - Setup 2** | 2589.01 | 3053.35 | 3284.69 | 3930.07 |

### CLIP Similarity (×100)
| Model | 1 Step | 2 Steps | 4 Steps | 8 Steps |
|-------|--------|--------|--------|--------|
| **Teacher (50 DDIM Steps)** | 27.88 | - | - | - |
| **Student - Setup 1** | 22.55 | 25.62 | 26.86 | 27.01 |
| **Student - Setup 2** | 20.13 | 23.41 | 25.31 | 24.62 |

## Conclusion
Setup 2 was modified to stabilize training and prevent the discriminator from overpowering the generator. The changes improved FVD scores for 1-step inference, while multi-step performance varied. CLIP similarity improved across multiple inference steps, indicating better text-to-video alignment.


## References
Original Implementation: [Motion Consistency Model](https:&#x2F;&#x2F;github.com&#x2F;yhZhai&#x2F;mcm)


## Download model

Weights for this model are available in Safetensors format.

[Download](/SepehrNoey/MCM-Simplified/tree/main) them in the Files & versions tab.