Meta SAM 2 Overview

2024-09-05 11:29 UTC gpt-4o Open in ChatGPT ↗

Yes, Meta has recently released the Segment Anything Model 2 (SAM 2), a major upgrade to their original image segmentation model. SAM 2 extends beyond static images to include video segmentation, making it capable of tracking objects across video frames in real time. This advancement addresses the more complex nature of video segmentation, where objects may move, change appearance, or become occluded. SAM 2 utilizes a memory-based transformer architecture that allows it to maintain information across video frames, improving its ability to handle such challenges.

Key improvements in SAM 2 include:

Real-time video segmentation with the ability to segment and follow objects across frames, supporting up to 44 frames per second.
Enhanced accuracy, performing better than its predecessor on multiple benchmarks, including both video and image segmentation tasks.
Lower interaction requirements, with SAM 2 requiring three times fewer human interactions for video segmentation, which speeds up annotation processes.
Faster performance, running six times faster than the original SAM on image tasks【5†source】【6†source】.

Additionally, SAM 2 is built on a much larger and more diverse dataset called Segment Anything Video (SA-V), which consists of over 51,000 videos and 600,000 annotations. This helps the model generalize better across diverse real-world scenarios【6†source】.

SAM 2’s real-time segmentation capabilities have promising applications in video editing, augmented reality, and autonomous vehicles, where precise object tracking is crucial【7†source】【8†source】.

To achieve real-time background subtraction in a webcam feed using Meta’s SAM 2 model, focusing on automatically detecting and masking the human closest to the camera without manual input, here’s a step-by-step outline of a potential approach:

1. Detect Human with Object Detection

Before using SAM 2 for segmentation, you can employ a lightweight object detection model like YOLO (You Only Look Once), or MediaPipe, both of which are efficient for detecting humans in real-time webcam feeds. These models can automatically generate bounding boxes for humans, identifying the one closest to the camera based on size or distance.

YOLOv5/v8: Efficient for real-time detection and supports bounding box output for humans.
MediaPipe: Specifically optimized for human pose estimation and detection in real-time environments. This could be used to detect and track the closest human in the feed.

2. Generate Bounding Box Automatically

Use the bounding box output from the object detection step to locate the human. Typically, the person closest to the camera will have the largest bounding box in terms of pixel area. You can programmatically select this largest bounding box to focus on the closest person.

3. Pass Bounding Box to SAM 2

Once the bounding box for the detected human is generated, you can feed this into SAM 2 as a prompt. SAM 2 supports bounding box inputs and can segment the object (in this case, the human) within the box efficiently.

4. Generate Mask

SAM 2 will output a mask based on the bounding box. This mask can be converted into a black-and-white binary image, where the human is white (foreground), and the rest of the frame is black (background). This mask can then be applied over the webcam feed for background subtraction.

5. Apply the Mask

Using OpenCV or a similar library, apply the mask to the webcam feed in real-time. This will keep only the segmented human visible, with the background either removed or replaced as needed.

Code Flow Example:

Here is a basic outline of how the code might look using Python, OpenCV, and a pre-trained SAM 2 model:

import cv2
import torch
from some_sam2_model import SAM2Model  # Load SAM 2 model
from yolo_model import YOLOModel       # Load YOLO or similar detection model

# Initialize models
yolo_model = YOLOModel('yolov8')  # For real-time object detection
sam_model = SAM2Model()           # For real-time segmentation

# Open webcam
cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Detect humans using YOLO (or MediaPipe)
    bbox = yolo_model.detect_human(frame)  # Returns the bbox of closest human

    if bbox:
        # Run SAM 2 with bounding box
        mask = sam_model.segment_with_bbox(frame, bbox)

        # Convert mask to binary image (white = human, black = background)
        _, mask_binary = cv2.threshold(mask, 128, 255, cv2.THRESH_BINARY)

        # Apply mask to the frame
        result = cv2.bitwise_and(frame, frame, mask=mask_binary)

        # Display the output
        cv2.imshow('Segmented Human', result)

    # Press 'q' to quit
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

Key Libraries:

YOLOv8 or MediaPipe: Used for fast human detection and bounding box generation.
SAM 2 Model: For generating the segmentation mask based on the bounding box. The SAM 2 code and weights are available under the Apache 2.0 license【8†source】【6†source】.
OpenCV: To handle webcam feed, mask generation, and applying the mask over the video frames.

Performance Considerations:

Efficiency: SAM 2 is optimized for real-time applications, processing up to 44 frames per second【8†source】.
Automated bounding box: YOLO or MediaPipe eliminates the need for manual input, making the process fully automated.

This setup provides a robust method to detect, segment, and isolate the human in real time while removing or replacing the background, suitable for interactive applications.

To integrate human detection and segmentation using YOLOv8 from the Ultralytics library and SAM 2, you can modify the code like this. The YOLO model will detect objects and provide bounding boxes for humans, and SAM 2 will use the bounding box to generate the segmentation mask.

Here’s how you can implement this, based on your setup:

Step-by-Step Code:

import cv2
import torch
from ultralytics import YOLO, SAM

# Initialize YOLO and SAM models
yolo_model = YOLO("yolov8n.pt")   # YOLO model for real-time human detection
sam_model = SAM("sam_b.pt")       # SAM 2 model for segmentation

# Open webcam
cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Detect objects using YOLO
    results = yolo_model(frame)

    # Filter out humans and get the largest bounding box (assuming class 0 is 'person')
    bboxes = []
    for result in results:
        for box in result.boxes:
            if box.cls == 0:  # Assuming '0' is the class ID for humans
                bboxes.append(box.xyxy[0].cpu().numpy())  # Bounding box coordinates (x1, y1, x2, y2)

    if bboxes:
        # Get the largest bounding box (closest person)
        largest_bbox = max(bboxes, key=lambda x: (x[2] - x[0]) * (x[3] - x[1]))

        # Pass the bounding box to SAM for segmentation
        mask = sam_model.predict(frame, bboxes=[largest_bbox])  # Bounding box input for segmentation

        # Convert the mask to binary image (human = white, background = black)
        mask_image = mask[0].mask.cpu().numpy()
        mask_image = cv2.threshold(mask_image, 128, 255, cv2.THRESH_BINARY)[1]

        # Apply the mask to the original frame
        result = cv2.bitwise_and(frame, frame, mask=mask_image.astype('uint8'))

        # Display the resulting frame
        cv2.imshow("Segmented Human", result)

    # Break loop on 'q' key press
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

Explanation:

YOLO Model:
- YOLO detects objects in real-time. You filter out human detections based on class ID (box.cls == 0, assuming 0 is for humans in the pretrained YOLOv8n model).
- It extracts bounding boxes for humans, and the closest human is assumed to have the largest bounding box area.
SAM Model:
- SAM uses the largest bounding box as input to generate a mask.
- The output is converted into a black-and-white mask where the human is white (foreground), and the background is black.
OpenCV:
- OpenCV processes the webcam feed and applies the mask to highlight the segmented human in real time.

Required Libraries:

Ultralytics for YOLO and SAM models.
OpenCV for handling the webcam feed, masking, and displaying the output.

This should work well for real-time applications where you want to detect the closest human and apply segmentation without manual input.