Papers
arxiv:2511.15948

Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

Published on Nov 20
· Submitted by Awsaf on Dec 3
Authors:
,
,

Abstract

Click2Graph is an interactive framework for panoptic video scene graph generation that combines user cues with dynamic interaction discovery and semantic classification for precise and controllable scene understanding.

AI-generated summary

State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts <subject, object, predicate> triplets to form a temporally consistent scene graph. Our framework introduces two key components: a Dynamic Interaction Discovery Module that generates subject-conditioned object prompts, and a Semantic Classification Head that performs joint entity and predicate reasoning. Experiments on the OpenPVSG benchmark demonstrate that Click2Graph establishes a strong foundation for user-guided PVSG, showing how human prompting can be combined with panoptic grounding and relational inference to enable controllable and interpretable video scene understanding.

Community

Paper author Paper submitter

Click2Graph is a framework that introduces the first interactive approach to Panoptic Video Scene Graph Generation (PVSG). Current video understanding systems are typically fully automated, meaning users cannot correct errors or guide the focus, while interactive tools like SAM2 handle segmentation but lack semantic understanding (knowing what an object is or how it interacts).

This method bridges that gap by allowing a user to provide a single visual cue, such as a click on a subject. The system then segments and tracks that subject across the video, automatically discovers other objects it is interacting with, and generates a structured scene graph describing the relationship (e.g., "person holding cup").

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.15948 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.15948 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.15948 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.