Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click
Abstract
Click2Graph is an interactive framework for panoptic video scene graph generation that combines user cues with dynamic interaction discovery and semantic classification for precise and controllable scene understanding.
State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts <subject, object, predicate> triplets to form a temporally consistent scene graph. Our framework introduces two key components: a Dynamic Interaction Discovery Module that generates subject-conditioned object prompts, and a Semantic Classification Head that performs joint entity and predicate reasoning. Experiments on the OpenPVSG benchmark demonstrate that Click2Graph establishes a strong foundation for user-guided PVSG, showing how human prompting can be combined with panoptic grounding and relational inference to enable controllable and interpretable video scene understanding.
Community
Click2Graph is a framework that introduces the first interactive approach to Panoptic Video Scene Graph Generation (PVSG). Current video understanding systems are typically fully automated, meaning users cannot correct errors or guide the focus, while interactive tools like SAM2 handle segmentation but lack semantic understanding (knowing what an object is or how it interacts).
This method bridges that gap by allowing a user to provide a single visual cue, such as a click on a subject. The system then segments and tracks that subject across the video, automatically discovers other objects it is interacting with, and generates a structured scene graph describing the relationship (e.g., "person holding cup").
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models (2025)
- Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning (2025)
- TGT: Text-Grounded Trajectories for Locally Controlled Video Generation (2025)
- Plan-X: Instruct Video Generation via Semantic Planning (2025)
- SNAP: Towards Segmenting Anything in Any Point Cloud (2025)
- MATRIX: Mask Track Alignment for Interaction-aware Video Generation (2025)
- PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper