Improve model card: add pipeline tag, license, abstract, and sample usage
Browse filesThis PR enhances the model card for [Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation](https://huggingface.co/papers/2510.08673) by:
- Adding the `pipeline_tag: text-to-3d` to the metadata, improving discoverability on the Hugging Face Hub.
- Adding the `license: other` to the metadata, with the specific "NTU S-Lab License 1.0" detailed in the content.
- Updating the paper link to the official Hugging Face Papers page: https://huggingface.co/papers/2510.08673.
- Including the paper's abstract for a more comprehensive overview.
- Adding a "Sample Usage" section with a code snippet for camera-controllable image generation, directly sourced from the project's GitHub repository.
- Adding the main banner image from the GitHub repository for better visual context.
- Consolidating and clarifying the project links.
|
@@ -6,12 +6,28 @@ tags:
|
|
| 6 |
- understanding
|
| 7 |
- spatial intelligence
|
| 8 |
- 3D vision
|
|
|
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
# **Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation**
|
|
|
|
| 12 |
<p align="center">
|
| 13 |
-
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
## Model Details
|
| 16 |
|
| 17 |
Puffin is a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. It learns the **camera-centric** understanding and generation tasks in **a unified multimodal framework**. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling **thinking with camera**. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context.
|
|
@@ -27,11 +43,50 @@ Puffin is a unified camera-centric multimodal model that extends spatial awarene
|
|
| 27 |
---
|
| 28 |
|
| 29 |
### Direct Use
|
| 30 |
-
-
|
| 31 |
-
-
|
| 32 |
-
-
|
| 33 |
-
-
|
|
|
|
|
|
|
| 34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
### Citation
|
| 37 |
If you find Puffin useful for your research or applications, please cite our paper using the following BibTeX:
|
|
|
|
| 6 |
- understanding
|
| 7 |
- spatial intelligence
|
| 8 |
- 3D vision
|
| 9 |
+
pipeline_tag: text-to-3d
|
| 10 |
+
license: other
|
| 11 |
---
|
| 12 |
|
| 13 |
# **Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation**
|
| 14 |
+
|
| 15 |
<p align="center">
|
| 16 |
+
<img src="https://github.com/KangLiao929/Puffin/blob/main/assets/website/tesear_horizon.png?raw=true" alt="Thinking with Camera" width="100%">
|
| 17 |
+
</p>
|
| 18 |
+
|
| 19 |
+
## Paper
|
| 20 |
+
This model was presented in the paper [Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation](https://huggingface.co/papers/2510.08673).
|
| 21 |
+
|
| 22 |
+
## Abstract
|
| 23 |
+
Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance.
|
| 24 |
+
|
| 25 |
+
## Links
|
| 26 |
+
* **Project Page**: [https://kangliao929.github.io/projects/puffin](https://kangliao929.github.io/projects/puffin)
|
| 27 |
+
* **GitHub Repository**: [https://github.com/KangLiao929/Puffin](https://github.com/KangLiao929/Puffin)
|
| 28 |
+
* **Hugging Face Space**: [https://huggingface.co/spaces/KangLiao/Puffin](https://huggingface.co/spaces/KangLiao/Puffin)
|
| 29 |
+
* **Hugging Face Dataset**: [https://huggingface.co/datasets/KangLiao/Puffin-4M](https://huggingface.co/datasets/KangLiao/Puffin-4M)
|
| 30 |
+
|
| 31 |
## Model Details
|
| 32 |
|
| 33 |
Puffin is a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. It learns the **camera-centric** understanding and generation tasks in **a unified multimodal framework**. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling **thinking with camera**. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context.
|
|
|
|
| 43 |
---
|
| 44 |
|
| 45 |
### Direct Use
|
| 46 |
+
- **Camera-centric understanding and generation** from a single image or a pair of text and camera, supports the thinking mode.
|
| 47 |
+
- **World exploration**: performs the cross-view generation from a given initial view and target camera configuration.
|
| 48 |
+
- **Spatial imagination**: imagines the scene description based on an initial view and target camera configuration.
|
| 49 |
+
- **3D virtual object insertion** in AR/VR: assists the virtual 3D object insertion into in-the-wild images by calibrating camera parameters
|
| 50 |
+
|
| 51 |
+
## Sample Usage
|
| 52 |
|
| 53 |
+
This section demonstrates how to generate images with camera control using Puffin-Base, based on the examples provided in the [GitHub repository](https://github.com/KangLiao929/Puffin).
|
| 54 |
+
|
| 55 |
+
First, download the model checkpoints from π€ [KangLiao/Puffin](https://huggingface.co/KangLiao/Puffin) and organize them in a `checkpoints` directory, for example:
|
| 56 |
+
```text
|
| 57 |
+
Puffin/
|
| 58 |
+
βββ checkpoints
|
| 59 |
+
βββ Puffin-Align.pth # provided for customized SFT
|
| 60 |
+
βββ Puffin-Base.pth
|
| 61 |
+
βββ Puffin-Thinking.pth
|
| 62 |
+
βββ Puffin-Instruct.pth
|
| 63 |
+
```
|
| 64 |
+
You can use `huggingface-cli` to download the checkpoints:
|
| 65 |
+
```bash
|
| 66 |
+
# pip install -U "huggingface_hub[cli]"
|
| 67 |
+
huggingface-cli download KangLiao/Puffin --local-dir checkpoints --repo-type model
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
To run the camera-controllable image generation:
|
| 71 |
+
|
| 72 |
+
```shell
|
| 73 |
+
export PYTHONPATH=./:$PYTHONPATH
|
| 74 |
+
python scripts/demo/generation.py configs/pipelines/stage_2_base.py \
|
| 75 |
+
--checkpoint checkpoints/Puffin-Base.pth --output generation_result.jpg \
|
| 76 |
+
--prompt "A streetlamp casts light on an outdoor mural with intricate floral designs and text, set against a building wall." \
|
| 77 |
+
-r -0.3939 -p 0.0277 -f 0.7595
|
| 78 |
+
```
|
| 79 |
+
This command generates an image based on the provided text prompt and camera parameters (roll: `-r`, pitch: `-p`, vertical field-of-view: `-f`, all in radians). The output image will be saved as `generation_result.jpg`.
|
| 80 |
+
|
| 81 |
+
To enable the thinking mode for image generation, please simply change the settings and append the `--thinking` flag:
|
| 82 |
+
|
| 83 |
+
```shell
|
| 84 |
+
python scripts/demo/generation.py configs/pipelines/stage_3_thinking.py \
|
| 85 |
+
--checkpoint checkpoints/Puffin-Thinking.pth --output generation_result_thinking.jpg \
|
| 86 |
+
--prompt "A streetlamp casts light on an outdoor mural with intricate floral designs and text, set against a building wall." \
|
| 87 |
+
-r -0.3939 -p 0.0277 -f 0.7595 \
|
| 88 |
+
--thinking
|
| 89 |
+
```
|
| 90 |
|
| 91 |
### Citation
|
| 92 |
If you find Puffin useful for your research or applications, please cite our paper using the following BibTeX:
|