File size: 3,140 Bytes
140fa85
 
 
 
 
 
35d4d1b
 
 
 
 
 
 
 
e6130f4
 
35d4d1b
 
 
 
 
 
 
 
140fa85
 
 
 
35d4d1b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140fa85
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
---
license: mit
pipeline_tag: text-to-video
library_name: diffusers
---

<div align="center">

# Pulp Motion: Framing-aware multimodal camera and human motion generation

<a href="https://robincourant.github.io/info/"><strong>Robin Courant</strong></a><a href="https://triocrossing.github.io/"><strong>Xi Wang</strong></a><a href="https://davidlapous.github.io/"><strong>David Loiseaux</strong></a><a href="http://people.irisa.fr/Marc.Christie/"><strong>Marc Christie</strong></a><a href="https://vicky.kalogeiton.info/"><strong>Vicky Kalogeiton</strong></a>

[![License](https://img.shields.io/badge/License-MIT-green.svg)]()

</div>

This model was presented in the paper [Pulp Motion: Framing-aware multimodal camera and human motion generation](https://huggingface.co/papers/2510.05097).

## Abstract
Treating human motion and camera trajectory generation separately overlooks a core principle of cinematography: the tight interplay between actor performance and camera work in the screen space. In this paper, we are the first to cast this task as a text-conditioned joint generation, aiming to maintain consistent on-screen framing while producing two heterogeneous, yet intrinsically linked, modalities: human motion and camera trajectories. We propose a simple, model-agnostic framework that enforces multimodal coherence via an auxiliary modality: the on-screen framing induced by projecting human joints onto the camera. This on-screen framing provides a natural and effective bridge between modalities, promoting consistency and leading to more precise joint distribution. We first design a joint autoencoder that learns a shared latent space, together with a lightweight linear transform from the human and camera latents to a framing latent. We then introduce auxiliary sampling, which exploits this linear transform to steer generation toward a coherent framing modality. To support this task, we also introduce the PulpMotion dataset, a human-motion and camera-trajectory dataset with rich captions, and high-quality human motions. Extensive experiments across DiT- and MAR-based architectures show the generality and effectiveness of our method in generating on-frame coherent human-camera motions, while also achieving gains on textual alignment for both modalities. Our qualitative results yield more cinematographically meaningful framings setting the new state of the art for this task.

<div align="center">
    <a href="https://www.lix.polytechnique.fr/vista/projects/2025_pulpmotion_courant/" class="button"><b>[Webpage]</b></a> &nbsp;&nbsp;&nbsp;&nbsp;
    <a href="https://github.com/robincourant/pulp-motion" class="button"><b>[Code]</b></a> &nbsp;&nbsp;&nbsp;&nbsp;
</div>

<br/>

![Teaser](./assets/teaser.png)

---

# Setup

First, install `git lfs` by following the instructions [here](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage).


To get the data, run:
```
git clone https://huggingface.co/datasets/robin-courant/pulpmotion-models
```


Prepare the dataset (untar archives):
```
cd pulpmotion-models
sh download_smpl
```