--- tags: - unified multimodal model - camera-centric - generation - understanding - spatial intelligence - 3D vision --- # **Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation**
   📖 Project Page  |    🖥️ GitHub    |   🤗 Hugging Face   |    📑 Paper   
## Model Details
Puffin is a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. It learns the **camera-centric** understanding and generation tasks in **a unified multimodal framework**. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling **thinking with camera**. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context.
| | |
|---|---|
| **Developed by** | Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy |
| **Affiliation** | S-Lab, Nanyang Technological University |
| **First released** | arXiv pre-print, 2025 |
| **Model type** | Unified multimodal models (diffusion / autoregressive modelling with camera-centric understanding and generation) |
| **Modality** | Image → Text+Camera; Text+Camera → Image; Image+Camera → Image; Image+Camera → Text |
---
### Direct Use
- **Camera-centric understanding and generation** from a single image or a pair of text and camera, supports the thinking mode.
- **World exploration**: performs the cross-view generation from a given initial view and target camera configuration.
- **Spatial imagination**: imagines the scene description based on an initial view and target camera configuration.
- **3D virtual object insertion** in AR/VR: assits the virtual 3D object insertion into in-the-wild images by calibrating camera parameters
### Citation
If you find Puffin useful for your research or applications, please cite our paper using the following BibTeX:
```bibtex
@article{liao2025puffin,
title={Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation},
author={Liao, Kang and Wu, Size and Wu, Zhonghua and Jin, Linyi and Wang, Chao and Wang, Yikai and Wang, Fei and Li, Wei and Loy, Chen Change},
journal={arXiv preprint arXiv:2510.18903},
year={2025}
}
```
### License
This project is licensed under [NTU S-Lab License 1.0](LICENSE).