nielsr HF Staff commited on
Commit
4b394de
Β·
verified Β·
1 Parent(s): 883d0a8

Improve model card: add pipeline tag, license, abstract, and sample usage

Browse files

This PR enhances the model card for [Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation](https://huggingface.co/papers/2510.08673) by:

- Adding the `pipeline_tag: text-to-3d` to the metadata, improving discoverability on the Hugging Face Hub.
- Adding the `license: other` to the metadata, with the specific "NTU S-Lab License 1.0" detailed in the content.
- Updating the paper link to the official Hugging Face Papers page: https://huggingface.co/papers/2510.08673.
- Including the paper's abstract for a more comprehensive overview.
- Adding a "Sample Usage" section with a code snippet for camera-controllable image generation, directly sourced from the project's GitHub repository.
- Adding the main banner image from the GitHub repository for better visual context.
- Consolidating and clarifying the project links.

Files changed (1) hide show
  1. README.md +61 -6
README.md CHANGED
@@ -6,12 +6,28 @@ tags:
6
  - understanding
7
  - spatial intelligence
8
  - 3D vision
 
 
9
  ---
10
 
11
  # **Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation**
 
12
  <p align="center">
13
- &nbsp&nbsp πŸ“– <a href="https://kangliao929.github.io/projects/puffin">Project Page</a>&nbsp&nbsp| &nbsp&nbsp πŸ–₯️ <a href="https://github.com/KangLiao929/Puffin">GitHub</a> &nbsp&nbsp | &nbsp&nbspπŸ€— <a href="https://huggingface.co/spaces/KangLiao/Puffin">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp πŸ“‘ <a href="https://arxiv.org/abs/2510.08673">Paper </a> &nbsp&nbsp
14
- <br>
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ## Model Details
16
 
17
  Puffin is a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. It learns the **camera-centric** understanding and generation tasks in **a unified multimodal framework**. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling **thinking with camera**. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context.
@@ -27,11 +43,50 @@ Puffin is a unified camera-centric multimodal model that extends spatial awarene
27
  ---
28
 
29
  ### Direct Use
30
- - **Camera-centric understanding and generation** from a single image or a pair of text and camera, supports the thinking mode.
31
- - **World exploration**: performs the cross-view generation from a given initial view and target camera configuration.
32
- - **Spatial imagination**: imagines the scene description based on an initial view and target camera configuration.
33
- - **3D virtual object insertion** in AR/VR: assits the virtual 3D object insertion into in-the-wild images by calibrating camera parameters
 
 
34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ### Citation
37
  If you find Puffin useful for your research or applications, please cite our paper using the following BibTeX:
 
6
  - understanding
7
  - spatial intelligence
8
  - 3D vision
9
+ pipeline_tag: text-to-3d
10
+ license: other
11
  ---
12
 
13
  # **Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation**
14
+
15
  <p align="center">
16
+ <img src="https://github.com/KangLiao929/Puffin/blob/main/assets/website/tesear_horizon.png?raw=true" alt="Thinking with Camera" width="100%">
17
+ </p>
18
+
19
+ ## Paper
20
+ This model was presented in the paper [Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation](https://huggingface.co/papers/2510.08673).
21
+
22
+ ## Abstract
23
+ Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance.
24
+
25
+ ## Links
26
+ * **Project Page**: [https://kangliao929.github.io/projects/puffin](https://kangliao929.github.io/projects/puffin)
27
+ * **GitHub Repository**: [https://github.com/KangLiao929/Puffin](https://github.com/KangLiao929/Puffin)
28
+ * **Hugging Face Space**: [https://huggingface.co/spaces/KangLiao/Puffin](https://huggingface.co/spaces/KangLiao/Puffin)
29
+ * **Hugging Face Dataset**: [https://huggingface.co/datasets/KangLiao/Puffin-4M](https://huggingface.co/datasets/KangLiao/Puffin-4M)
30
+
31
  ## Model Details
32
 
33
  Puffin is a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. It learns the **camera-centric** understanding and generation tasks in **a unified multimodal framework**. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling **thinking with camera**. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context.
 
43
  ---
44
 
45
  ### Direct Use
46
+ - **Camera-centric understanding and generation** from a single image or a pair of text and camera, supports the thinking mode.
47
+ - **World exploration**: performs the cross-view generation from a given initial view and target camera configuration.
48
+ - **Spatial imagination**: imagines the scene description based on an initial view and target camera configuration.
49
+ - **3D virtual object insertion** in AR/VR: assists the virtual 3D object insertion into in-the-wild images by calibrating camera parameters
50
+
51
+ ## Sample Usage
52
 
53
+ This section demonstrates how to generate images with camera control using Puffin-Base, based on the examples provided in the [GitHub repository](https://github.com/KangLiao929/Puffin).
54
+
55
+ First, download the model checkpoints from πŸ€— [KangLiao/Puffin](https://huggingface.co/KangLiao/Puffin) and organize them in a `checkpoints` directory, for example:
56
+ ```text
57
+ Puffin/
58
+ β”œβ”€β”€ checkpoints
59
+ β”œβ”€β”€ Puffin-Align.pth # provided for customized SFT
60
+ β”œβ”€β”€ Puffin-Base.pth
61
+ β”œβ”€β”€ Puffin-Thinking.pth
62
+ β”œβ”€β”€ Puffin-Instruct.pth
63
+ ```
64
+ You can use `huggingface-cli` to download the checkpoints:
65
+ ```bash
66
+ # pip install -U "huggingface_hub[cli]"
67
+ huggingface-cli download KangLiao/Puffin --local-dir checkpoints --repo-type model
68
+ ```
69
+
70
+ To run the camera-controllable image generation:
71
+
72
+ ```shell
73
+ export PYTHONPATH=./:$PYTHONPATH
74
+ python scripts/demo/generation.py configs/pipelines/stage_2_base.py \
75
+ --checkpoint checkpoints/Puffin-Base.pth --output generation_result.jpg \
76
+ --prompt "A streetlamp casts light on an outdoor mural with intricate floral designs and text, set against a building wall." \
77
+ -r -0.3939 -p 0.0277 -f 0.7595
78
+ ```
79
+ This command generates an image based on the provided text prompt and camera parameters (roll: `-r`, pitch: `-p`, vertical field-of-view: `-f`, all in radians). The output image will be saved as `generation_result.jpg`.
80
+
81
+ To enable the thinking mode for image generation, please simply change the settings and append the `--thinking` flag:
82
+
83
+ ```shell
84
+ python scripts/demo/generation.py configs/pipelines/stage_3_thinking.py \
85
+ --checkpoint checkpoints/Puffin-Thinking.pth --output generation_result_thinking.jpg \
86
+ --prompt "A streetlamp casts light on an outdoor mural with intricate floral designs and text, set against a building wall." \
87
+ -r -0.3939 -p 0.0277 -f 0.7595 \
88
+ --thinking
89
+ ```
90
 
91
  ### Citation
92
  If you find Puffin useful for your research or applications, please cite our paper using the following BibTeX: