Complete Guide to SAM2 Video Segmentation in Python

Leave a Comment / Image Segmentation, Pytorch

Contents hide

1 Let’s break down SAM2 video segmentation in plain English

2 Building a Full SAM2 Video Segmentation Pipeline Step by Step

2.1 Best AI Photo Tools (Backgrounds, Objects, Headshots)

4 Complete Guide to SAM2 Video Segmentation in Python

5 Setting up SAM2 so video segmentation runs smoothly

6 Want the sample video file used in this tutorial?

7 Extracting your video into frame images you can work with

8 Loading SAM2 and previewing the frame you will prompt on

9 Segmenting the first frame with a few point prompts

10 Propagating the mask across the entire video and saving results

11 Exporting binary masks and overlay images for every segmented frame

12 FAQ – SAM2 Video Segmentation in Python

12.1 What is SAM2 video segmentation in simple terms?

12.2 Why do we extract frames before running segmentation?

12.3 How many point prompts are usually enough?

12.4 What do point labels mean in the code?

12.5 Do I need a GPU to run SAM2 video segmentation?

12.6 Why are some masks empty on certain frames?

12.7 How do I segment multiple objects in one video?

12.8 What can I do if the mask drifts over time?

12.9 Should I save binary masks, overlays, or both?

12.10 How can I speed up testing while developing the script?

Last Updated on 05/02/2026 by Eran Feit

SAM2 video segmentation is a practical way to turn a single user interaction into consistent object masks across an entire video.
Instead of segmenting every frame manually, you mark the object once using a few point clicks, and the model carries that object forward through time.
This makes video segmentation feel much closer to an interactive editing workflow than a traditional “train a model first” pipeline.
It’s especially useful when you need fast results, you don’t have labeled data, or you’re prototyping an idea and want accurate masks quickly.

At a high level, SAM2 video segmentation combines strong per-frame segmentation with a memory mechanism that helps the model stay consistent as the object moves.
When the target changes shape, rotates, or becomes partially occluded, the model uses information from previous frames to keep tracking the same object.
That means fewer corrections, less flickering, and more stable masks compared to running a static image segmenter on each frame independently.
In practice, this turns “video segmentation” into a repeatable process you can run on new videos with minimal setup.

In Python, the workflow usually starts by breaking your video into frames.
Working with frames gives you full control over indexing, saving outputs, debugging results, and visualizing what the model is doing.
From there, you initialize the SAM2 video predictor with a chosen checkpoint and configuration, pick the first frame, and provide point prompts that tell the model what to segment.
Once the first mask is created, the predictor propagates the segmentation across the remaining frames to produce a mask sequence over time.

The final step is turning those masks into useful assets.
Binary masks are perfect for downstream processing, like background removal, compositing, measurement, or training data generation.
Overlay masks are great for sanity checks, demos, and visual reporting because they show the segmentation aligned on top of the original video frames.
When you save both, you end up with a clean, production-friendly output that supports editing, analytics, and automation workflows.

Tip me and Download the code

Let’s break down SAM2 video segmentation in plain English

SAM2 video segmentation is all about taking a tiny amount of guidance and producing a full set of masks across a whole clip.
You don’t need to label a dataset, train a custom model, or write complicated tracking logic.
Instead, you provide simple prompts on one frame, and the model does the heavy lifting to keep the segmentation consistent through motion and scene changes.
That makes it a strong fit for creators, developers, and researchers who want accurate results without a long setup cycle.

A typical pipeline begins with frame extraction, because it makes the video easy to process step by step.
Once frames are stored in a folder, you can sort them, choose a starting frame, and visualize exactly what you’re prompting.
This also makes outputs straightforward, because you can save masks using matching filenames and keep everything aligned.
It’s a clean structure that scales from short clips to longer videos without changing the core logic.

Next comes the interactive part: point prompts.
Positive points tell the model “this is part of the object,” and negative points can be used to say “this area is not the object.”
Even a few positive points can be enough to lock onto the target, especially when the object is clear.
If the first mask isn’t perfect, you refine it by adding more points, which is fast and intuitive.

After the model has a confident mask on the first frame, propagation is what turns this into true video segmentation.
The predictor moves through the video frames and produces a mask for the same object ID on each frame it can track.
The goal is temporal consistency, meaning the object remains the same target even if it changes pose, scale, or partially disappears.
At the end, you store the masks in a structured form, then export binary masks and overlay images for practical use in editing or analysis.

SAM2 video segmentation

Building a Full SAM2 Video Segmentation Pipeline Step by Step

This tutorial code is designed to take you from a raw video file to complete segmentation outputs, using a workflow that feels practical and repeatable.
Instead of treating video segmentation as a complicated research project, the code breaks it into clear stages you can run, debug, and improve one step at a time.
By the end, you have both machine-friendly masks and human-friendly visual overlays saved to disk.
The goal is to make SAM2 video segmentation feel like a straightforward Python pipeline you can reuse for new videos.

The first part of the code focuses on turning the video into frames, because frame-based processing is easier to control than working on a compressed video stream directly.
Once every frame is saved as an image, you can reference frames by index, inspect them visually, and keep outputs aligned with the original timeline.
This also makes it simple to rerun only parts of the pipeline without reprocessing the entire video.
It’s the foundation that keeps the rest of the workflow organized.

Next, the code initializes the SAM2 video predictor using a configuration file and a downloaded checkpoint.
This is the point where you “load the brain” of the system and choose whether you’re running on GPU or CPU.
From there, the pipeline selects a starting frame and uses a few point prompts to define the object you care about.
Those point prompts act like a quick interactive signal that tells the model exactly what to segment.

After the initial mask is created on the first frame, the core of the tutorial happens: propagating that mask across the entire video.
The code iterates through the frames and collects predicted masks into a structured dictionary, keyed by frame index and object ID.
This makes the output easy to save, reload, and post-process later, without needing to rerun the model every time.
It also helps you handle real-world situations like empty masks, missing frames, or cases where the object disappears.

Finally, the tutorial code turns the segmentation results into usable outputs.
It visualizes overlays on top of the original frames so you can quickly validate quality and spot where the segmentation might drift.
It also generates binary masks, which are ideal for editing, compositing, measurements, dataset creation, or any downstream computer vision task.
Saving both overlay and binary formats gives you the best of both worlds: clear visuals for humans and clean masks for machines.

Link to the video tutorial here .

Download the code for the tutorial here or here

My Blog

You can follow my blog here .

Link for Medium users here .

Want to get started with Computer Vision or take your skills to the next level ?

Great Interactive Course : “Deep Learning for Images with PyTorch” here

If you’re just beginning, I recommend this step-by-step course designed to introduce you to the foundations of Computer Vision – Complete Computer Vision Bootcamp With PyTorch & TensorFlow

If you’re already experienced and looking for more advanced techniques, check out this deep-dive course – Modern Computer Vision GPT, PyTorch, Keras, OpenCV4

SAM2 video segmentation

Complete Guide to SAM2 Video Segmentation in Python

SAM2 video segmentation is one of the fastest ways to get clean object masks across an entire video without training a custom model.
Instead of labeling hundreds of frames, you guide the model with a few point clicks on a single frame, and then let it propagate that mask forward through the video.
This makes the workflow feel much more like an interactive editing tool, but still fully scriptable in Python.

In this tutorial, the goal is to build a complete end-to-end pipeline.
You will extract frames from a video, initialize a SAM2 video predictor with a pretrained checkpoint, create a mask using point prompts, and then propagate that mask across all frames.
Finally, you will export both binary masks and overlay visualizations so you can reuse the results in other projects.

By the end, you will have a repeatable SAM2 video segmentation script you can run on any similar video.
You can keep the outputs for dataset creation, video editing workflows, object removal, background replacement, analytics, or just to understand how promptable video segmentation works in practice.

Setting up SAM2 so video segmentation runs smoothly

A stable environment is the difference between a fun segmentation workflow and a day of dependency bugs.
In this section, you will create a clean Conda environment, install a compatible PyTorch build, and install the SAM2 repository in editable mode so Python can import it.

You will also download the model checkpoints that power SAM2 video segmentation.
Once those files are in place and your IDE points to the correct interpreter, you are ready to run the two scripts in this tutorial.

### Create a fresh Conda environment for SAM2 video segmentation. conda create -n sam2 python=3.12 ### Activate the environment so installs go to the right place. conda activate sam2  ### Check your CUDA version so you can match the right PyTorch build. nvcc --version   ### Install PyTorch, TorchVision, and TorchAudio with CUDA support. conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia  ### Install Matplotlib for plotting masks and frames. pip install matplotlib==3.10.0 ### Install OpenCV for video frame extraction and basic image operations. pip install opencv-python==4.10.0.84 ### Install Supervision for common CV visualization utilities. pip install supervision==0.25.1  ### Move to a working folder where you keep tutorial projects. c: ### Enter your tutorials directory. cd tutorials  ### Clone the SAM2 repository to your local machine. git clone https://github.com/facebookresearch/sam2.git  ### Enter the cloned repository folder. cd sam2  ### Install SAM2 in editable mode so Python can import it. pip install -e .  ### Download the SAM2 checkpoints into the checkpoints folder. wget -O checkpoints/sam2.1_hiera_tiny.pt https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_tiny.pt  wget -O checkpoints/sam2.1_hiera_small.pt https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_small.pt  wget -O checkpoints/sam2.1_hiera_base_plus.pt https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_base_plus.pt  wget -O checkpoints/sam2.1_hiera_large.pt https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt   ### Open VSCode and set the working folder to c:/tutorials/SAM2. ### Choose the SAM2 Conda environment as your Python interpreter. ### Copy the "Code" folder into c:/tutorials/SAM2 or create a "Code" folder.

Short summary.
Your environment now contains SAM2, the required Python libraries, and the pretrained checkpoints needed for SAM2 video segmentation.

Want the sample video file used in this tutorial?

The video file used for testing in this workflow can be large, and that makes it annoying to host directly inside the post.
If you want the exact same video so your frame numbers and results match the code, you can request it by email.

Send me a short message and I will share the file with you. My email is : feitgemel@gmail.com

Extracting your video into frame images you can work with

SAM2 video segmentation becomes easier when you treat a video as a folder of numbered frames.
This approach makes debugging simpler, because you can open a single frame, verify coordinates, and confirm that masks line up the way you expect.

In the script below, you will read a video with OpenCV and save every frame as a JPEG image.
The function returns a list of saved filenames, which is useful if you want to reuse that list later for sorting, checking counts, or batch processing.

### Import OpenCV so we can read the video and write frames as images. import cv2  ### Import os so we can create folders and build file paths safely. import os   ### Define a helper function that extracts every frame from a video file. def extract_frames(video_path, output_folder):      ### Create the output folder if it does not exist yet.     os.makedirs(output_folder, exist_ok=True)      ### Open the video file using OpenCV.     cap = cv2.VideoCapture(video_path)     ### Prepare a list to store paths for all saved frames.     frame_list = []     ### Start counting frames from zero so names match the index.     frame_index = 0      ### Loop until the video ends or reading fails.     while cap.isOpened():         ### Read the next frame from the video capture object.         ret , frame = cap.read()         ### If OpenCV returns no frame, we reached the end of the video.         if not ret:             ### Stop the loop when there are no more frames.             break # Stop if no frame is returned          ### Build the filename for the current frame using the frame index.         frame_filename = os.path.join(output_folder, f"{frame_index}.jpg")         ### Save the current frame as a JPEG image.         cv2.imwrite(frame_filename, frame) # save as Jpeg image         ### Add the saved frame path to the list so we can count and reuse it.         frame_list.append(frame_filename)          ### Increment the index so the next frame gets the next filename.         frame_index += 1         ### Print progress so you can see extraction is running.         print(f"Extracted frame {frame_index} to {frame_filename}")      ### Release the video capture object so the file handle closes cleanly.     cap.release() # Release the video capture object     ### Return the list of frame image paths.     return frame_list  ### Point to the input video used in the tutorial. video_path = "code2/dog-and-person.mp4"  ### Choose an output folder where extracted frames will be saved. output_folder = "frames" ### Run the extraction process and collect the list of saved frame files. frames = extract_frames(video_path, output_folder) ### Print the final count so you know how many frames were extracted. print(f"Extracted {len(frames)} frames from the video. Output saved in '{output_folder}' folder.")

Short summary.
You now have a frames folder where each image is a video frame, ready for SAM2 video segmentation.

Loading SAM2 and previewing the frame you will prompt on

Before you segment anything, it helps to confirm that your video frames load correctly and are sorted in the right order.
This part of the code sets up the device, loads the SAM2 predictor, collects frame filenames, and displays the first frame as a visual sanity check.

You will also define small helper functions for drawing masks, points, and boxes.
These helpers make the rest of the tutorial easier to read, and they make debugging much faster when you want to verify whether a prompt is landing where you think it is.

### Import NumPy for point arrays, masks, and general numeric operations. import numpy as np ### Import torch to choose CPU or GPU for running SAM2. import torch  ### Import Matplotlib so we can display frames and masks inline. import matplotlib.pyplot as plt ### Import OpenCV for utility operations if needed later. import cv2  ### Import os for file and folder operations. import os  ### Import PIL Image to load frames easily for Matplotlib display. from PIL import Image  ### Set a random seed so any random color logic stays consistent. np.random.seed(3)  ### Select the device for computation based on CUDA availability. device = torch.device("cuda" if torch.cuda.is_available() else "cpu") ### Print the selected device so you know if you are on GPU or CPU. print(f"Using device: {device}")  ### Define a helper to draw a semi-transparent mask overlay. def show_mask(mask, ax, obj_id=None, random_color=False):     ### Pick a random color if requested for visualization variety.     if random_color:         ### Build an RGBA color with fixed alpha for transparency.         color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)     ### Otherwise, select a consistent color from a colormap.     else:         ### Use a categorical colormap so multiple objects get different colors.         cmap = plt.get_cmap("tab10")         ### Map object id to a stable color slot.         cmap_idx = 0 if obj_id is None else obj_id % 10         ### Build an RGBA color with an alpha channel.         color = np.array([*cmap(cmap_idx)[:3], 0.6])     ### Extract height and width of the mask.     h, w = mask.shape[-2:]     ### Convert the mask into an RGBA image for overlay display.     mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)     ### Render the overlay on the provided axes.     ax.imshow(mask_image)  ### Define a helper to draw positive and negative click points. def show_points(coords, labels, ax, marker_size=200):     ### Separate positive points where label equals 1.     pos_points = coords[labels == 1]     ### Separate negative points where label equals 0.     neg_points = coords[labels == 0]     ### Draw positive points as green stars.     ax.scatter(pos_points[:, 0], pos_points[:, 1], color='green', marker='*', s=marker_size, edgecolor='white', linewidth=1.25)     ### Draw negative points as red stars.     ax.scatter(neg_points[:, 0], neg_points[:, 1], color='red', marker='*', s=marker_size, edgecolor='white', linewidth=1.25)  ### Define a helper to draw a bounding box if you decide to use one later. def show_box(box, ax):     ### Unpack the top-left x and y coordinates.     x0, y0 = box[0], box[1]     ### Compute width and height from corner coordinates.     w, h = box[2] - box[0], box[3] - box[1]     ### Draw the rectangle on the axes with a visible outline.     ax.add_patch(plt.Rectangle((x0, y0), w, h, edgecolor='green', facecolor=(0, 0, 0, 0), lw=2))  ### Choose the SAM2 checkpoint file you downloaded earlier. sam2_checkpoint = "checkpoints/sam2.1_hiera_large.pt" # download it in the install part  ### Choose the model configuration YAML from the SAM2 repository. model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml" # part of the SAM2 repo  ### Import the video predictor builder from SAM2. from sam2.build_sam import build_sam2_video_predictor  ### Build the SAM2 video predictor with config, checkpoint, and chosen device. predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint, device=device)  ### Point to the folder that contains extracted video frames. video_dir = "frames"  ### Collect and sort frame filenames numerically by index. frame_names = sorted(     [p for p in os.listdir(video_dir) if p.lower().endswith(('.jpg', '.jpeg', ))],     key=lambda p: int(os.path.splitext(p)[0]) )  ### Choose which frame to preview before prompting. frame_idx = 0  ### Create a figure sized for comfortable viewing. plt.figure(figsize=(9,6)) ### Add a title so you know which frame index you are seeing. plt.title(f"Frame {frame_idx}") ### Display the chosen frame image. plt.imshow(Image.open(os.path.join(video_dir, frame_names[frame_idx]))) ### Render the plot window. plt.show()

Short summary.
You have loaded SAM2, prepared visualization helpers, and confirmed that your extracted frames load in the correct order.

Segmenting the first frame with a few point prompts

This section is where SAM2 video segmentation becomes interactive.
You pick one frame, provide a handful of positive points on the object you care about, and the predictor produces a mask for that object.

The output mask logits are then thresholded into a binary mask for visualization.
Once you like the mask on this first frame, you can treat it as the anchor that SAM2 will propagate across the rest of the video.

### Initialize the internal inference state using the frames folder as the video source. inference_state = predictor.init_state(video_path = video_dir)  ### Reset the predictor state so this run starts cleanly. predictor.reset_state(inference_state)  ### Choose the annotation frame index where you provide prompts. ann_frame_idx = 0 # the first frame ### Choose an object id so the model can track this object consistently. ann_obj_id = 1  ### Define positive points that land on the object you want to segment. points = np.array([[1363, 642], [1342, 688], [1410, 726]], dtype=np.float32) ### Define labels for each point where 1 is positive and 0 is negative. labels = np.array([1, 1, 1], dtype=np.int32) # 1 for positive, 0 for negative  ### Add the points into SAM2 and request a new mask for this object. _ , out_obj_ids , out_mask_logits = predictor.add_new_points_or_box(     inference_state=inference_state,     frame_idx=ann_frame_idx,     obj_id = ann_obj_id,     points=points,     labels=labels, )  ### Create a new figure to visualize the prompt and the resulting mask. plt.figure(figsize=(9, 6)) ### Add a title so you know which frame is being displayed. plt.title(f"Frame {ann_frame_idx} ") ### Show the original frame image. plt.imshow(Image.open(os.path.join(video_dir, frame_names[ann_frame_idx]))) ### Draw the prompt points on top of the image. show_points(points, labels, plt.gca()) ### Convert mask logits to a boolean mask and draw it as an overlay. show_mask((out_mask_logits[0] > 0.0).cpu().numpy(), plt.gca(), obj_id=out_obj_ids[0]) ### Render the plot window. plt.show()

Short summary.
You created an initial object mask using only point prompts, which becomes the seed for propagating SAM2 video segmentation through time.

Propagating the mask across the entire video and saving results

Once SAM2 understands the object in the first frame, the propagate step turns that single mask into a full video segmentation output.
The predictor iterates through frames and returns masks for the tracked object, which you store in a structured dictionary for easy reuse.

Saving the results to disk is important because propagation can take time on long videos.
With a saved segments file, you can reload masks later for visualization and export without rerunning the model every time.

### Print a message so you know propagation has started. print("Propagating segmentation across the video...") ### Create a dictionary that will store per-frame segmentation results. video_segments = {}  ### Iterate through predicted masks as SAM2 propagates across frames. for out_frame_idx , out_obj_ids , out_mask_logits in predictor.propagate_in_video(inference_state):      ### Create a new dictionary entry for this frame index.     video_segments[out_frame_idx] = {}       ### Loop through object ids returned for this frame.     for i , out_obj_id in enumerate(out_obj_ids):         ### Convert logits to a boolean mask.         mask_values = (out_mask_logits[i] > 0.0).cpu().numpy()         ### Skip empty masks so you do not save noise.         if mask_values.sum() == 0:             ### Print a message to help you spot frames where the object is missing.             print(f"Skipping empty mask for object {out_obj_id} in frame {out_frame_idx}")             ### Continue to the next object id.             continue         ### Store the mask for this object id in this frame.         video_segments[out_frame_idx][out_obj_id] = mask_values  ### Print how many frames ended up with segmentation masks. print(f"Total frames with segments: {len(video_segments)}")        ### Import pickle so we can save the dictionary to disk. import pickle  ### Choose an output path for the saved segmentation dictionary. output_path = "d:/temp/sam2_video_segments.pkl" ### Create the destination folder if it does not exist. os.makedirs(os.path.dirname(output_path), exist_ok=True) ### Write the segmentation dictionary to disk as a pickle file. with open(output_path, 'wb') as f:     pickle.dump(video_segments, f) ### Print confirmation so you know the file was saved. print(f"Video segments saved to {output_path}")

Short summary.
You propagated the mask across frames and saved a reusable segments file, which is the core output of SAM2 video segmentation.

Exporting binary masks and overlay images for every segmented frame

Now that you have masks stored per frame, you can visualize results and export them into formats you can reuse.
Overlay images are great for quick quality checks, because you can see whether the mask stays aligned as the subject moves.

Binary masks are ideal for downstream processing.
They give you a clean black-and-white image that can be used for compositing, background removal, measurement, or dataset generation.

### Create a larger figure for showing two views side by side. plt.figure(figsize=(12, 4))  ### Create a subplot for the original frame with overlay masks. ax1 = plt.subplot(1, 2, 1) # original / overlay frame ### Create a subplot for the binary mask view. ax2 = plt.subplot(1, 2, 2) # binary mask frame  ### Choose how often to visualize frames during playback. vis_frame_stride = 1   ### Loop through frames and show overlays and masks. for out_frame_idx in range(0, len(frame_names), vis_frame_stride):     ### Skip frames where no segmentation was produced.     if out_frame_idx not in video_segments:         ### Print a message so you know why a frame is skipped.         print(f"Skipping frame {out_frame_idx} : No segments found")         ### Continue to the next frame.         continue      ### Build the full path to the frame image.     frame_path = os.path.join(video_dir, frame_names[out_frame_idx])     ### Load the frame image for display.     frame_img = Image.open(frame_path)      ### Clear the overlay axis so the new frame draws cleanly.     ax1.clear()     ### Clear the binary axis so the new mask draws cleanly.     ax2.clear()      ### Display the original frame for overlay visualization.     ax1.imshow(frame_img)     ### Draw every object mask as an overlay on the original frame.     for out_obj_id, out_mask in video_segments[out_frame_idx].items():         show_mask(out_mask, ax1, obj_id=out_obj_id)     ### Set a helpful title for the overlay view.     ax1.set_title(f"Frame {out_frame_idx} - Overlay")     ### Remove axis ticks for a cleaner look.     ax1.axis('off')      ### Draw binary masks for the same frame.     for out_obj_id, out_mask in video_segments[out_frame_idx].items():         ### Create a blank binary mask image using the frame height and width.         binary_mask = np.zeros_like(np.array(frame_img)[:, :, 0])         ### Fill mask pixels with white where the object is present.         binary_mask[out_mask.squeeze(0)] = 255          ### Display the binary mask image in grayscale.         ax2.imshow(binary_mask, cmap='gray')      ### Set a helpful title for the binary view.     ax2.set_title(f"Frame {out_frame_idx} - Binary Mask")     ### Remove axis ticks for a cleaner look.     ax2.axis('off')      ### Keep the layout tidy while animating through frames.     plt.tight_layout     ### Pause briefly so Matplotlib updates the view.     plt.pause(0.001)  # Pause to update the plot  ### Create output folders for saving binary and overlay masks. binary_mask_folder = "frames_output/binary_masks" overlay_mask_folder = "frames_output/overlay_masks" ### Ensure both folders exist before saving. os.makedirs(binary_mask_folder, exist_ok=True) os.makedirs(overlay_mask_folder, exist_ok=True)  ### Save individual masks and overlay images for each segmented frame. for out_frame_idx in video_segments.keys():      ### Load the frame corresponding to this segmentation output.     frame_path = os.path.join(video_dir, frame_names[out_frame_idx])     ### Open the frame image for saving overlays.     frame_img = Image.open(frame_path)      ### Loop through objects in this frame.     for out_obj_id , out_mask in video_segments[out_frame_idx].items():         ### Create a blank binary mask image.         binary_mask = np.zeros_like(np.array(frame_img)[:, :, 0])         ### Fill the mask pixels with white where the object is present.         binary_mask[out_mask.squeeze(0)] = 255         ### Convert the NumPy array into an image for saving.         binary_mask_img = Image.fromarray(binary_mask)         ### Save the binary mask image as a PNG.         binary_mask_img.save(os.path.join(binary_mask_folder, f"mask_{out_frame_idx}.png"))          ### Build the overlay output path for this frame.         overlay_mask_path = os.path.join(overlay_mask_folder, f"overlay_{out_frame_idx}.png")          ### Create a figure for exporting the overlay view.         plt.figure(figsize=(6,4))         ### Show the original frame image.         plt.imshow(frame_img)         ### Draw the mask overlay on top of the original image.         show_mask(out_mask, plt.gca(), obj_id=out_obj_id)         ### Hide axes so the saved image looks clean.         plt.axis('off')         ### Save the overlay image to disk.         plt.savefig(overlay_mask_path, bbox_inches='tight', pad_inches=0)         ### Close the figure to avoid memory growth during long videos.         plt.close()

Short summary.
You exported both binary masks and overlay images for SAM2 video segmentation, which makes the results easy to reuse in other workflows.

FAQ – SAM2 Video Segmentation in Python

What is SAM2 video segmentation in simple terms?

You click a few points on an object in one frame, and SAM2 generates masks for that object across the full video. It reduces manual labeling while staying consistent over time.

Why do we extract frames before running segmentation?

Frames give you stable indexing and easy debugging. They also make saving masks and overlays straightforward and reproducible.

How many point prompts are usually enough?

Often 1–5 positive points are enough for a clear object. Add more points when edges look wrong or the object blends with the background.

What do point labels mean in the code?

Label 1 means a positive point on the object. Label 0 means a negative point telling SAM2 what should not be included.

Do I need a GPU to run SAM2 video segmentation?

No, CPU works, but a GPU can speed up propagation a lot. For long videos, GPU is strongly recommended for practical runtimes.

Why are some masks empty on certain frames?

The object may be occluded, off-screen, or too small to segment confidently. Skipping empty masks keeps your outputs clean and easier to use.

How do I segment multiple objects in one video?

Use a different obj_id per object and add prompts for each one. Store masks per frame and object id so you can export them separately.

What can I do if the mask drifts over time?

Add a few more prompts on a later frame where drift starts and propagate again. Extra guidance usually improves stability in hard motion scenes.

Should I save binary masks, overlays, or both?

Save both when possible. Overlays are best for quick visual checks, while binary masks are best for automation, editing, and dataset creation.

How can I speed up testing while developing the script?

Use a short video and visualize every N frames instead of every frame. Once it looks correct, run full export for the complete output set.

Conclusion

You now have a complete, practical SAM2 video segmentation workflow that you can run end-to-end in Python.
You started by extracting frames, which gives you total control over indexing, debugging, and saving outputs.
Then you initialized a SAM2 video predictor, created a high-quality mask using point prompts, and propagated that mask through time.

The biggest win of this approach is that it scales naturally.
If you need a quick mask for a short clip, you can do it in minutes with only a few clicks.
If you need a larger batch workflow, you can run the same pipeline on multiple videos and keep consistent outputs across projects.

From here, the next upgrades are all about refinement and reuse.
You can add negative points for cleaner edges, segment multiple objects with multiple obj_id values, or build utilities that combine masks into a single labeled output.
And since you export both binary masks and overlays, you can plug these results into editing pipelines, dataset generation, or other computer vision tasks without changing the core logic.

Connect :

☕ Buy me a coffee — https://ko-fi.com/eranfeit

🖥️ Email : feitgemel@gmail.com

🌐 https://eranfeit.net

🤝 Fiverr : https://www.fiverr.com/s/mB3Pbb

Enjoy,

Eran

Leave a Comment Cancel Reply