--- tags: - unified multimodal model - camera-centric - generation - understanding - spatial intelligence - 3D vision --- # **Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation**

   📖 Project Page  |    🖥️ GitHub    |   🤗 Hugging Face   |    📑 Paper   
## Model Details Puffin is a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. It learns the **camera-centric** understanding and generation tasks in **a unified multimodal framework**. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling **thinking with camera**. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. | | | |---|---| | **Developed by** | Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy | | **Affiliation** | S-Lab, Nanyang Technological University | | **First released** | arXiv pre-print, 2025 | | **Model type** | Unified multimodal models (diffusion / autoregressive modelling with camera-centric understanding and generation) | | **Modality** | Image → Text+Camera; Text+Camera → Image; Image+Camera → Image; Image+Camera → Text | --- ### Direct Use - **Camera-centric understanding and generation** from a single image or a pair of text and camera, supports the thinking mode. - **World exploration**: performs the cross-view generation from a given initial view and target camera configuration. - **Spatial imagination**: imagines the scene description based on an initial view and target camera configuration. - **3D virtual object insertion** in AR/VR: assits the virtual 3D object insertion into in-the-wild images by calibrating camera parameters ### Citation If you find Puffin useful for your research or applications, please cite our paper using the following BibTeX: ```bibtex @article{liao2025puffin, title={Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation}, author={Liao, Kang and Wu, Size and Wu, Zhonghua and Jin, Linyi and Wang, Chao and Wang, Yikai and Wang, Fei and Li, Wei and Loy, Chen Change}, journal={arXiv preprint arXiv:2510.18903}, year={2025} } ``` ### License This project is licensed under [NTU S-Lab License 1.0](LICENSE).