SAM3 Agent available through Transformers?
Just wondering if we are able to use the sam3 agent for complex prompting for videos through transformers? or do I have to clone the original repo?
SAM 3 repo only have SAM 3 agent for image. No SAM 3 agent demo for video yet.
Hey @pnaderi4 , transformers focuses on providing barebone modeling files (sam3 video is a bit of an exception in that sense), so Sam3 Agent is not implemented in Transformers, as it's more of a complex pipeline wrapping different model calls with an MLLM.
That said, it would be a cool project to try replicating Sam3 Agent using transformers models, if that's something that would interest you.
@pcuenq @pzzhang @yonigozlan , gotcha, I am trying to do a face segmentation using sam3 and text prompts (essentially trying to blurr everything except the head/face). Right now my current implementation (my first time using SAM model btw, not sure if this correct?):
Step 1 — Load & Downscale Video
- Load the input video.
- Downscale it to 360p to dramatically speed up SAM3 inference (≈6× faster).
- Compute the resize ratio so output masks can later be upscaled correctly.
Step 2 — Parallel Frame Loading (CPU)
- Use a threaded loader to read all frames concurrently.
- Resize frames to 360p during loading.
- Convert BGR → RGB for SAM3 compatibility.
- Store resized frames in memory.
Step 3 — Start SAM3 Video Session
- Save the downscaled frames into a temporary low-resolution video file.
- Initialize SAM3 Video Predictor:
-- Mixed precision (BF16/FP16 depending on GPU).
-- Create a new session with the low-res temp video.
Step 4 — Handle the Prompt (Simple vs Complex)
- Check the API parameter is_complex.
- If simple (is_complex = false)
-- Directly add a text prompt to the Video Predictor (e.g., “Head”, “Face”, “Person”). - If complex (is_complex = true)
-- Use SAM3 Agent on the first frame only to interpret the natural-language description (e.g., “person wearing red shirt”).
-- Agent returns segmentation masks.
-- Convert each mask into a centroid point prompt.
-- Add those point prompts to the Video Predictor session. - Fallback
-- If Agent fails → revert to simple text prompt.
Step 5 — SAM3 Video Inference & Mask Tracking
- Run the Predictor in streaming mode.
- Process every 4th frame to speed up inference (~4×).
- Predictor tracks the object across frames:
- Collect masks for each processed frame.
Step 6 — Interpolate & Upscale Masks
- Interpolate missing masks for skipped frames.
- Upscale all masks from 360p back to original video resolution.
- Convert all masks to boolean GPU tensors.
Step 7 — Load Original Video Frames to GPU
- Load all full-resolution video frames into a GPU tensor:
-- Shape: [num_frames, 3, H, W]
-- Converted to float32, normalized to [0,1].
Step 8 — Precompute Full-Video Gaussian Blur
- Apply a heavy Gaussian blur to the entire batch of frames:
-- Kernel: 75 × 75
-- Sigma: 12.5 - Store blurred version of the video on GPU.
Step 9 — Mask-Based Blending (Batch = 64)
- For each frame batch:
-- sharp_regions = original * mask
-- blurred_regions = blurred * (1 - mask)
-- final = sharp_regions + blurred_regions - Result:
-- subject stays sharp, background is blurred.
Step 10 — Write Final Video
- Move processed frames back to CPU as numpy arrays.
- Write frames into an output video using original FPS.
- Cleanup temporary files and free GPU memory.
Step 11 — Preserve Audio & Final Assembly
- Extract the original audio track (ffmpeg).
- After video processing completes, reattach the untouched audio.
- Output the final, fully synchronized video-with-audio.