SAM3 Agent available through Transformers?

#26
by pnaderi4 - opened

Just wondering if we are able to use the sam3 agent for complex prompting for videos through transformers? or do I have to clone the original repo?

AI at Meta org

Hello @pnaderi4 ! The model card also has a few transformers snippets here. Let us know if you need additional details on any tasks 🤗

AI at Meta org

SAM 3 repo only have SAM 3 agent for image. No SAM 3 agent demo for video yet.

AI at Meta org

Hey @pnaderi4 , transformers focuses on providing barebone modeling files (sam3 video is a bit of an exception in that sense), so Sam3 Agent is not implemented in Transformers, as it's more of a complex pipeline wrapping different model calls with an MLLM.

That said, it would be a cool project to try replicating Sam3 Agent using transformers models, if that's something that would interest you.

@pcuenq @pzzhang @yonigozlan , gotcha, I am trying to do a face segmentation using sam3 and text prompts (essentially trying to blurr everything except the head/face). Right now my current implementation (my first time using SAM model btw, not sure if this correct?):

Step 1 — Load & Downscale Video

  • Load the input video.
  • Downscale it to 360p to dramatically speed up SAM3 inference (≈6× faster).
  • Compute the resize ratio so output masks can later be upscaled correctly.

Step 2 — Parallel Frame Loading (CPU)

  • Use a threaded loader to read all frames concurrently.
  • Resize frames to 360p during loading.
  • Convert BGR → RGB for SAM3 compatibility.
  • Store resized frames in memory.

Step 3 — Start SAM3 Video Session

  • Save the downscaled frames into a temporary low-resolution video file.
  • Initialize SAM3 Video Predictor:
    -- Mixed precision (BF16/FP16 depending on GPU).
    -- Create a new session with the low-res temp video.

Step 4 — Handle the Prompt (Simple vs Complex)

  • Check the API parameter is_complex.
  • If simple (is_complex = false)
    -- Directly add a text prompt to the Video Predictor (e.g., “Head”, “Face”, “Person”).
  • If complex (is_complex = true)
    -- Use SAM3 Agent on the first frame only to interpret the natural-language description (e.g., “person wearing red shirt”).
    -- Agent returns segmentation masks.
    -- Convert each mask into a centroid point prompt.
    -- Add those point prompts to the Video Predictor session.
  • Fallback
    -- If Agent fails → revert to simple text prompt.

Step 5 — SAM3 Video Inference & Mask Tracking

  • Run the Predictor in streaming mode.
  • Process every 4th frame to speed up inference (~4×).
  • Predictor tracks the object across frames:
  • Collect masks for each processed frame.

Step 6 — Interpolate & Upscale Masks

  • Interpolate missing masks for skipped frames.
  • Upscale all masks from 360p back to original video resolution.
  • Convert all masks to boolean GPU tensors.

Step 7 — Load Original Video Frames to GPU

  • Load all full-resolution video frames into a GPU tensor:
    -- Shape: [num_frames, 3, H, W]
    -- Converted to float32, normalized to [0,1].

Step 8 — Precompute Full-Video Gaussian Blur

  • Apply a heavy Gaussian blur to the entire batch of frames:
    -- Kernel: 75 × 75
    -- Sigma: 12.5
  • Store blurred version of the video on GPU.

Step 9 — Mask-Based Blending (Batch = 64)

  • For each frame batch:
    -- sharp_regions = original * mask
    -- blurred_regions = blurred * (1 - mask)
    -- final = sharp_regions + blurred_regions
  • Result:
    -- subject stays sharp, background is blurred.

Step 10 — Write Final Video

  • Move processed frames back to CPU as numpy arrays.
  • Write frames into an output video using original FPS.
  • Cleanup temporary files and free GPU memory.

Step 11 — Preserve Audio & Final Assembly

  • Extract the original audio track (ffmpeg).
  • After video processing completes, reattach the untouched audio.
  • Output the final, fully synchronized video-with-audio.

Sign up or log in to comment