To achieve real-time background subtraction in a webcam feed using Meta’s SAM 2 model, focusing on automatically detecting and masking the human closest to the camera without manual input, here’s a step-by-step outline of a potential approach:
Before using SAM 2 for segmentation, you can employ a lightweight object detection model like YOLO (You Only Look Once), or MediaPipe, both of which are efficient for detecting humans in real-time webcam feeds. These models can automatically generate bounding boxes for humans, identifying the one closest to the camera based on size or distance.
- YOLOv5/v8: Efficient for real-time detection and supports bounding box output for humans.
- MediaPipe: Specifically optimized for human pose estimation and detection in real-time environments. This could be used to detect and track the closest human in the feed.
Use the bounding box output from the object detection step to locate the human. Typically, the person closest to the camera will have the largest bounding box in terms of pixel area. You can programmatically select this largest bounding box to focus on the closest person.
Once the bounding box for the detected human is generated, you can feed this into SAM 2 as a prompt. SAM 2 supports bounding box inputs and can segment the object (in this case, the human) within the box efficiently.
SAM 2 will output a mask based on the bounding box. This mask can be converted into a black-and-white binary image, where the human is white (foreground), and the rest of the frame is black (background). This mask can then be applied over the webcam feed for background subtraction.
Using OpenCV or a similar library, apply the mask to the webcam feed in real-time. This will keep only the segmented human visible, with the background either removed or replaced as needed.
Here is a basic outline of how the code might look using Python, OpenCV, and a pre-trained SAM 2 model:
from some_sam2_model import SAM2Model # Load SAM 2 model
from yolo_model import YOLOModel # Load YOLO or similar detection model
yolo_model = YOLOModel('yolov8') # For real-time object detection
sam_model = SAM2Model() # For real-time segmentation
cap = cv2.VideoCapture(0)
# Detect humans using YOLO (or MediaPipe)
bbox = yolo_model.detect_human(frame) # Returns the bbox of closest human
# Run SAM 2 with bounding box
mask = sam_model.segment_with_bbox(frame, bbox)
# Convert mask to binary image (white = human, black = background)
_, mask_binary = cv2.threshold(mask, 128, 255, cv2.THRESH_BINARY)
# Apply mask to the frame
result = cv2.bitwise_and(frame, frame, mask=mask_binary)
cv2.imshow('Segmented Human', result)
if cv2.waitKey(1) & 0xFF == ord('q'):
- YOLOv8 or MediaPipe: Used for fast human detection and bounding box generation.
- SAM 2 Model: For generating the segmentation mask based on the bounding box. The SAM 2 code and weights are available under the Apache 2.0 license【8†source】【6†source】.
- OpenCV: To handle webcam feed, mask generation, and applying the mask over the video frames.
- Efficiency: SAM 2 is optimized for real-time applications, processing up to 44 frames per second【8†source】.
- Automated bounding box: YOLO or MediaPipe eliminates the need for manual input, making the process fully automated.
This setup provides a robust method to detect, segment, and isolate the human in real time while removing or replacing the background, suitable for interactive applications.