---
tags:
- unified multimodal model
- camera-centric
- generation
- understanding
- spatial intelligence
- 3D vision
pipeline_tag: text-to-3d
license: other
---
# **Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation**
## Paper
This model was presented in the paper [Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation](https://huggingface.co/papers/2510.08673).
## Abstract
Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance.
## Links
* **Project Page**: [https://kangliao929.github.io/projects/puffin](https://kangliao929.github.io/projects/puffin)
* **GitHub Repository**: [https://github.com/KangLiao929/Puffin](https://github.com/KangLiao929/Puffin)
* **Hugging Face Space**: [https://huggingface.co/spaces/KangLiao/Puffin](https://huggingface.co/spaces/KangLiao/Puffin)
* **Hugging Face Dataset**: [https://huggingface.co/datasets/KangLiao/Puffin-4M](https://huggingface.co/datasets/KangLiao/Puffin-4M)
## Model Details
Puffin is a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. It learns the **camera-centric** understanding and generation tasks in **a unified multimodal framework**. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling **thinking with camera**. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context.
| | |
|---|---|
| **Developed by** | Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy |
| **Affiliation** | S-Lab, Nanyang Technological University |
| **First released** | arXiv pre-print, 2025 |
| **Model type** | Unified multimodal models (diffusion / autoregressive modelling with camera-centric understanding and generation) |
| **Modality** | Image → Text+Camera; Text+Camera → Image; Image+Camera → Image; Image+Camera → Text |
---
### Direct Use
- **Camera-centric understanding and generation** from a single image or a pair of text and camera, supports the thinking mode.
- **World exploration**: performs the cross-view generation from a given initial view and target camera configuration.
- **Spatial imagination**: imagines the scene description based on an initial view and target camera configuration.
- **3D virtual object insertion** in AR/VR: assists the virtual 3D object insertion into in-the-wild images by calibrating camera parameters
## Sample Usage
This section demonstrates how to generate images with camera control using Puffin-Base, based on the examples provided in the [GitHub repository](https://github.com/KangLiao929/Puffin).
First, download the model checkpoints from 🤗 [KangLiao/Puffin](https://huggingface.co/KangLiao/Puffin) and organize them in a `checkpoints` directory, for example:
```text
Puffin/
├── checkpoints
├── Puffin-Align.pth # provided for customized SFT
├── Puffin-Base.pth
├── Puffin-Thinking.pth
├── Puffin-Instruct.pth
```
You can use `huggingface-cli` to download the checkpoints:
```bash
# pip install -U "huggingface_hub[cli]"
huggingface-cli download KangLiao/Puffin --local-dir checkpoints --repo-type model
```
To run the camera-controllable image generation:
```shell
export PYTHONPATH=./:$PYTHONPATH
python scripts/demo/generation.py configs/pipelines/stage_2_base.py \
--checkpoint checkpoints/Puffin-Base.pth --output generation_result.jpg \
--prompt "A streetlamp casts light on an outdoor mural with intricate floral designs and text, set against a building wall." \
-r -0.3939 -p 0.0277 -f 0.7595
```
This command generates an image based on the provided text prompt and camera parameters (roll: `-r`, pitch: `-p`, vertical field-of-view: `-f`, all in radians). The output image will be saved as `generation_result.jpg`.
To enable the thinking mode for image generation, please simply change the settings and append the `--thinking` flag:
```shell
python scripts/demo/generation.py configs/pipelines/stage_3_thinking.py \
--checkpoint checkpoints/Puffin-Thinking.pth --output generation_result_thinking.jpg \
--prompt "A streetlamp casts light on an outdoor mural with intricate floral designs and text, set against a building wall." \
-r -0.3939 -p 0.0277 -f 0.7595 \
--thinking
```
### Citation
If you find Puffin useful for your research or applications, please cite our paper using the following BibTeX:
```bibtex
@article{liao2025puffin,
title={Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation},
author={Liao, Kang and Wu, Size and Wu, Zhonghua and Jin, Linyi and Wang, Chao and Wang, Yikai and Wang, Fei and Li, Wei and Loy, Chen Change},
journal={arXiv preprint arXiv:2510.08673},
year={2025}
}
```
### License
This project is licensed under [NTU S-Lab License 1.0](LICENSE).