---
tags:
- unified multimodal model
- camera-centric
- generation
- understanding
- spatial intelligence
- 3D vision
---

# **Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation**
<p align="center">
     &nbsp&nbsp 📖 <a href="https://kangliao929.github.io/projects/puffin">Project Page</a>&nbsp&nbsp｜ &nbsp&nbsp 🖥️ <a href="https://github.com/KangLiao929/Puffin">GitHub</a> &nbsp&nbsp  | &nbsp&nbsp🤗 <a href="https://huggingface.co/spaces/KangLiao/Puffin">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2506.18903v1">Paper </a> &nbsp&nbsp
<br>
## Model Details

Puffin is a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. It learns the **camera-centric** understanding and generation tasks in **a unified multimodal framework**. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling **thinking with camera**. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context.

| | |
|---|---|
| **Developed by** | Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy |
| **Affiliation** | S-Lab, Nanyang Technological University |
| **First released** | arXiv pre-print, 2025 |
| **Model type** | Unified multimodal models (diffusion / autoregressive modelling with camera-centric understanding and generation) |
| **Modality** | Image → Text+Camera; Text+Camera → Image; Image+Camera → Image; Image+Camera → Text |

---

### Direct Use
- **Camera-centric understanding and generation** from a single image or a pair of text and camera, supports the thinking mode.  
- **World exploration**: performs the cross-view generation from a given initial view and target camera configuration. 
- **Spatial imagination**: imagines the scene description based on an initial view and target camera configuration.
- **3D virtual object insertion** in AR/VR: assits the virtual 3D object insertion into in-the-wild images by calibrating camera parameters


### Citation
If you find Puffin useful for your research or applications, please cite our paper using the following BibTeX:

```bibtex
  @article{liao2025puffin,
    title={Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation},
    author={Liao, Kang and Wu, Size and Wu, Zhonghua and Jin, Linyi and Wang, Chao and Wang, Yikai and Wang, Fei and Li, Wei and Loy, Chen Change},
    journal={arXiv preprint arXiv:2510.18903},
    year={2025}
  }
```

### License 
This project is licensed under [NTU S-Lab License 1.0](LICENSE).