SAM3 Agent available through Transformers?

#26

by pnaderi4 - opened 4 days ago

4 days ago

Just wondering if we are able to use the sam3 agent for complex prompting for videos through transformers? or do I have to clone the original repo?

AbacusGauge

4 days ago

check https://huggingface.co/facebook/sam3/discussions/14#6920814caec0e4ff3886db60

pcuenq

AI at Meta org 4 days ago

Hello @pnaderi4 ! The model card also has a few transformers snippets here. Let us know if you need additional details on any tasks 🤗

pzzhang

AI at Meta org 4 days ago

SAM 3 repo only have SAM 3 agent for image. No SAM 3 agent demo for video yet.

yonigozlan

AI at Meta org 4 days ago

Hey @pnaderi4 , transformers focuses on providing barebone modeling files (sam3 video is a bit of an exception in that sense), so Sam3 Agent is not implemented in Transformers, as it's more of a complex pipeline wrapping different model calls with an MLLM.

That said, it would be a cool project to try replicating Sam3 Agent using transformers models, if that's something that would interest you.

pnaderi4

4 days ago

•

edited 4 days ago

@pcuenq @pzzhang @yonigozlan , gotcha, I am trying to do a face segmentation using sam3 and text prompts (essentially trying to blurr everything except the head/face). Right now my current implementation (my first time using SAM model btw, not sure if this correct?):

Step 1 — Load & Downscale Video

Load the input video.
Downscale it to 360p to dramatically speed up SAM3 inference (≈6× faster).
Compute the resize ratio so output masks can later be upscaled correctly.

Step 2 — Parallel Frame Loading (CPU)

Use a threaded loader to read all frames concurrently.
Resize frames to 360p during loading.
Convert BGR → RGB for SAM3 compatibility.
Store resized frames in memory.

Step 3 — Start SAM3 Video Session

Save the downscaled frames into a temporary low-resolution video file.
Initialize SAM3 Video Predictor:
-- Mixed precision (BF16/FP16 depending on GPU).
-- Create a new session with the low-res temp video.

Step 4 — Handle the Prompt (Simple vs Complex)

Check the API parameter is_complex.
If simple (is_complex = false)
-- Directly add a text prompt to the Video Predictor (e.g., “Head”, “Face”, “Person”).
If complex (is_complex = true)
-- Use SAM3 Agent on the first frame only to interpret the natural-language description (e.g., “person wearing red shirt”).
-- Agent returns segmentation masks.
-- Convert each mask into a centroid point prompt.
-- Add those point prompts to the Video Predictor session.
Fallback
-- If Agent fails → revert to simple text prompt.

Step 5 — SAM3 Video Inference & Mask Tracking

Run the Predictor in streaming mode.
Process every 4th frame to speed up inference (~4×).
Predictor tracks the object across frames:
Collect masks for each processed frame.

Step 6 — Interpolate & Upscale Masks

Interpolate missing masks for skipped frames.
Upscale all masks from 360p back to original video resolution.
Convert all masks to boolean GPU tensors.

Step 7 — Load Original Video Frames to GPU

Load all full-resolution video frames into a GPU tensor:
-- Shape: [num_frames, 3, H, W]
-- Converted to float32, normalized to [0,1].

Step 8 — Precompute Full-Video Gaussian Blur

Apply a heavy Gaussian blur to the entire batch of frames:
-- Kernel: 75 × 75
-- Sigma: 12.5
Store blurred version of the video on GPU.

Step 9 — Mask-Based Blending (Batch = 64)

For each frame batch:
-- sharp_regions = original * mask
-- blurred_regions = blurred * (1 - mask)
-- final = sharp_regions + blurred_regions
Result:
-- subject stays sharp, background is blurred.

Step 10 — Write Final Video

Move processed frames back to CPU as numpy arrays.
Write frames into an output video using original FPS.
Cleanup temporary files and free GPU memory.

Step 11 — Preserve Audio & Final Assembly

Extract the original audio track (ffmpeg).
After video processing completes, reattach the untouched audio.
Output the final, fully synchronized video-with-audio.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment