Segment Anything Tutorial: Fast Auto Masks in Python

Leave a Comment / Image Segmentation, Pytorch, VIT

Last Updated on 16/11/2025 by Eran Feit

Getting comfortable with the plan

This guide focuses on automatic mask generation using Segment Anything with the ViT-H checkpoint.
You’ll start by preparing a reliable Python environment that supports CUDA (if available) for GPU acceleration.
Then you’ll load the SAM model, configure the automatic mask generator, and select an image for inference.
Finally, you’ll visualize the annotated results, sort masks by area, and display every mask in a tidy grid for further analysis.

Getting to know Segment Anything (and why this tutorial matters)

In this Segment Anything tutorial, we’ll demystify Meta’s foundation model for image segmentation and show you how to get production-ready masks with just a few lines of Python.
Segment Anything (SAM) is designed to generalize: it can segment any object in an image—even ones it has never seen during training—using simple prompts or fully automatic mask generation.

At its core, SAM uses a powerful Vision Transformer (ViT) backbone to encode images into rich feature maps.
On top of that, a lightweight prompt encoder and mask decoder let you guide the model with points, boxes, or coarse regions—or skip prompting entirely and let the automatic mask generator propose high-quality masks across the scene.

Why is this exciting for practical work?
Instead of building a new segmentation model for every dataset or object category, you can bootstrap strong masks with SAM and then refine only where needed.
That’s huge for rapid prototyping, data labeling, and downstream tasks like medical pre-segmentation, retail catalog cleanup, robotics perception, geospatial analysis, and creative tools.

This Segment Anything tutorial keeps things hands-on and efficient.
You’ll set up a clean environment, load the high-accuracy ViT-H checkpoint, and run the SamAutomaticMaskGenerator end-to-end on a real image.
We’ll visualize the results with colored overlays, sort masks by area to see dominant regions first, and display every single mask in a tidy grid for quick inspection.

You’ll also learn about practical trade-offs.
GPU acceleration (CUDA) speeds up inference dramatically, but the same code runs on CPU if you’re just exploring or working with smaller images.
We’ll keep the pipeline minimal and readable so you can adapt it to your own datasets, plug in filtering by predicted IoU or stability score, and export masks for labeling or training other models.

If you’re new to segmentation, you might also like K-Means Image Segmentation with OpenCV (Beginner-Friendly).

Link for the video tutorial : https://youtu.be/vmDs2d0CTFk

You can download the code here : https://eranfeit.lemonsqueezy.com/buy/1db47387-3c11-4016-bc63-956e5cb01029

or here : https://ko-fi.com/s/5612d8bb7a

Link to the post for Medium users : https://medium.com/@feitgemel/segment-anything-tutorial-fast-auto-masks-in-python-c3f61555737e

You can find more tutorials in my blog : https://eranfeit.net/blog/

🚀 Want to get started with Computer Vision or take your skills to the next level ?

If you’re just beginning, I recommend this step-by-step course designed to introduce you to the foundations of Computer Vision – Complete Computer Vision Bootcamp With PyTorch & TensorFlow

If you’re already experienced and looking for more advanced techniques, check out this deep-dive course – Modern Computer Vision GPT, PyTorch, Keras, OpenCV4

Ready, set, environment: a clean Python setup for SAM

Before we touch the model, we’ll ensure your environment is reproducible.
Conda keeps dependencies isolated, so your experiments won’t conflict with other projects.
CUDA availability will unlock GPU acceleration; otherwise, SAM still runs on CPU—just slower.
We’ll add essential packages: PyTorch matching your CUDA version, OpenCV for image IO, Matplotlib and Supervision for visualization.

A robust environment is the difference between smooth progress and dependency headaches. By starting with a fresh conda environment (Python 3.9+), we lock in predictable versions for PyTorch, TorchVision, and Torchaudio that align with your CUDA toolkit. If you’re on CPU only, PyTorch still works—just install the CPU build and proceed; your results will be identical, execution just takes longer.

CUDA acceleration dramatically reduces inference time for SAM’s ViT-H backbone. Confirm your CUDA version with nvcc --version on systems with NVIDIA GPUs and drivers installed. This determines which prebuilt PyTorch wheel you should install. Matching versions avoids cryptic runtime errors and quirky performance drops.

OpenCV (opencv-python) handles image reading and color conversion. Matplotlib offers quick plotting, while Supervision adds a high-level annotator and convenient plotting grid so you can see segments and masks without writing a ton of boilerplate code. Keeping these tools minimal makes the tutorial approachable and the pipeline easy to extend.

For SAM itself, install from the official repository. The code blocks below intentionally omit direct URLs to keep them clean and copy-paste safe; the download link for the ViT-H checkpoint and the repository page are mentioned in the text sections around the code. Store your sam_vit_h_4b8939.pth checkpoint on a fast local drive and point the script to that path.

Create a fresh environment, confirm CUDA, install PyTorch, OpenCV, Matplotlib, and Supervision. Install Segment Anything from its official repo.

### Create an isolated environment so dependencies don’t clash.
conda create --name SAM-Tutorial python=3.9 -y

### Activate the environment to install and run everything inside it.
conda activate SAM-Tutorial

### (Optional) Check CUDA compiler to pick the right PyTorch build.
nvcc --version

### Install PyTorch/TorchVision/Torchaudio matching your CUDA or CPU setup.
# Example (CUDA 11.8 users would pick the matching build on the official PyTorch site).
# CPU-only users should install the CPU builds accordingly.
# Replace with the exact command recommended by PyTorch for your system.
# (Keep code blocks link-free; follow the official site instructions mentioned above.)
# conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

### Install core Python packages for IO and plotting.
pip install opencv-python matplotlib

### Install Supervision for easy annotations and plotting grids.
pip install supervision

### Install Segment Anything from the official repository (omit URLs in code blocks).
# Follow the repo instructions outside this block, or clone + pip install -e .
# pip install -e .

Your environment is ready. If you have an NVIDIA GPU, you’ll enjoy faster inference. Next, you’ll load the ViT-H checkpoint and build the automatic mask generator.

Want a neural-network baseline for medical tasks? Check out U-Net Medical Segmentation with TensorFlow & Keras.

Load ViT-H and prepare the automatic mask generator

This step wires the SAM model into a simple Python script.
You’ll select the device (GPU if available), point to the ViT-H checkpoint, and build SamAutomaticMaskGenerator.
Once the generator is instantiated, you’re a single call away from high-quality, fully automatic masks.
We’ll keep the code compact and readable so you can reuse it across projects.

Choosing the right device is key for performance. We detect a CUDA GPU with torch.cuda.is_available() and default to cuda:0 if present; otherwise we fall back to CPU. This keeps the script portable between laptops, workstations, and servers.

The ViT-H checkpoint (sam_vit_h_4b8939.pth) is the highest-capacity backbone offered by the original SAM release, known for strong accuracy. Save the file locally and set the path string accordingly. If disk is slow, consider placing the file on an SSD to reduce load times.

sam_model_registry maps a short string like "vit_h" to the correct model constructor. We pass the checkpoint path, then move the model to the selected device. With a single line, SamAutomaticMaskGenerator(sam), we obtain a ready-to-use automatic segmentation engine.

The generator’s .generate(image) method takes a NumPy image and returns a list of dictionaries containing the binary mask and helpful metadata (area, bbox, IoU estimates, stability). This structure is perfect for quick visualization and downstream analysis such as filtering by area or IoU thresholds.

Pick the device, load the ViT-H checkpoint, and instantiate the automatic mask generator.

### Import SAM model registry and the automatic mask generator helper.
from segment_anything import sam_model_registry, SamAutomaticMaskGenerator

### Import torch to detect and use GPU if available.
import torch

### Choose GPU if present, else fall back to CPU for portability.
DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

### Select the ViT-H model type for best accuracy in the original SAM release.
MODEL_TYPE = "vit_h"

### Set the local path to the downloaded sam_vit_h_4b8939.pth checkpoint.
pathForSamModel = "e:/temp/sam_vit_h_4b8939.pth"

### Build the SAM model from the registry and move it to the selected device.
sam = sam_model_registry[MODEL_TYPE](checkpoint=pathForSamModel).to(device=DEVICE)

### Create an automatic mask generator that produces masks with one call.
mask_generator = SamAutomaticMaskGenerator(sam)

Your SAM model is live and ready. In the next part you’ll feed an image through the generator, visualize the annotated result, and inspect every mask.

Visualize segments, sort by area, and display every mask

Now we’ll run inference on a real image, visualize the annotated segmentation, and export individual masks.
We’ll convert the BGR image from OpenCV to RGB where needed and use Supervision to draw masks.
Finally, we’ll sort masks by area and display them in a clean grid for instant insight.
This lets you validate segmentation quality and choose the masks that matter.

OpenCV loads images as BGR by default, which is fine for inference here. For some visualizations or libraries, you may convert to RGB for correct colors. Using Supervision’s MaskAnnotator and Detections.from_sam offers an elegant way to draw semantic regions without custom plotting code.

mask_generator.generate(image) returns a Python list of mask dictionaries. Each dictionary includes a boolean segmentation array, area, bbox, predicted IoU, and other useful metadata. Sorting by area is a simple but powerful way to explore masks—large regions often correspond to dominant objects, while small masks capture details.

We’ll plot a side-by-side view: the original image and an annotated image with all masks overlaid. Then we’ll collect the binary segmentation arrays and present them in a multi-row, multi-column grid. This grid is a great QA tool and also useful for thumbnails or downstream selection logic.

Keep in mind that SAM is resolution-aware: higher-resolution inputs can yield more precise masks but cost more compute. For large images, consider resizing to balance speed and quality, or run the generator on crops if you need high-detail segmentation across very large scenes.

Read an image, generate masks, draw annotations, sort by area, and plot each mask in a grid.

Here is our test image :

Brain MRI — Segment Anything Tutorial: Fast Auto Masks in Python 5

### Import OpenCV for image IO and Supervision for annotation and plotting.
import cv2
import supervision as sv

### Read a test image from disk in BGR format.
image_bgr = cv2.imread("Best-Semantic-Segmentation-models/Segment-Anything/1-Automated Mask Generation/brain-MRI.jpg")

### Convert to RGB if a library expects RGB for correct colors (optional for plotting).
image_rgb = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)

### Run the automatic mask generator to produce a list of mask dictionaries.
sam_result = mask_generator.generate(image_bgr)

### Inspect the keys available in each mask dictionary for analysis.
print(sam_result[0].keys())

### Create a MaskAnnotator to draw all predicted masks on the original image.
mask_annotator = sv.MaskAnnotator(color_lookup=sv.ColorLookup.INDEX)

### Convert SAM results to a Detections object compatible with Supervision utilities.
detections = sv.Detections.from_sam(sam_result=sam_result)

### Draw colored masks over a copy of the source image for a clear side-by-side view.
anotated_image = mask_annotator.annotate(scene=image_bgr.copy(), detections=detections)

### Visualize source vs annotated segmentation in a 1x2 grid for quick QA.
sv.plot_images_grid(
    images=[image_bgr, anotated_image],
    grid_size=(1, 2),
    titles=['source image', 'segmented image']
)

### Sort masks from largest to smallest by their pixel area for prioritization.
sorted_sam_result = sorted(sam_result, key=lambda x: x['area'], reverse=True)

### Collect the boolean segmentation arrays for individual display.
masks = []

### Iterate through each mask dictionary and extract the binary mask array.
for mask in sorted_sam_result:
    segmentation_value = mask['segmentation']
    masks.append(segmentation_value)

### Compute rows and columns for a compact, readable grid of masks.
import math
num_masks = len(masks)
num_cols = 8
num_rows = math.ceil(num_masks / num_cols)

### Display all binary masks in a uniform grid to explore details.
sv.plot_images_grid(
    images=masks,
    grid_size=(num_rows, num_cols),
    size=(16, 16)
)

You generated and visualized masks, inspected metadata, and plotted every mask with clean utilities. You’re now set to filter masks, export polygons, or feed them into downstream tasks.

If Vision Transformers interest you, see my guide: Build an Image Classifier with Vision Transformer (ViT).

Here is the result :

Brain MRI Segmented — Segment Anything Tutorial: Fast Auto Masks in Python 6

FAQ :

What is Segment Anything (SAM)?

SAM is a foundation model for image segmentation that generates masks with minimal prompting. We use its automatic generator for quick results.

Do I need a GPU to run SAM?

A GPU isn’t required but speeds up inference. CPU works for small images and learning, just expect slower execution.

Which checkpoint should I use?

ViT-H offers the strongest accuracy in the original SAM release. Use it when quality is more important than speed.

What image format should I load?

OpenCV handles most formats. Read as BGR and pass the NumPy array directly to SAM; convert to RGB when needed for visualization.

Why sort masks by area?

Area sorting highlights dominant regions first and helps you quickly inspect details or outliers.

How do I visualize masks?

Use Supervision’s MaskAnnotator for overlays and plot_images_grid for side-by-side and multi-mask displays.

Can I filter masks by confidence?

Yes. Filter using predicted IoU or stability scores to keep only the most reliable masks.

How big can images be?

Very large images can be slow. Resize or process in tiles to balance accuracy and speed.

Is RGB conversion required?

Not for the inference shown here. Use RGB when a plotting library expects it for correct colors.

Can I export masks as polygons?

Yes. Trace contours from binary masks with OpenCV and export polygons to JSON/COCO for downstream tasks.

Conclusion

You now have a complete, working pipeline for the original Segment Anything: a clean environment, a properly loaded ViT-H checkpoint, and a straightforward route to automatic mask generation. The code is intentionally minimal yet expressive, so you can adapt it to datasets, cameras, or pre-/post-processing steps with almost no friction.

From here, tailor the generator parameters and add filters on predicted IoU or stability to curate the best regions for your task. Sorting by area is a quick win, but you can also rank by confidence, overlap, or shape complexity, depending on the downstream use case. If you work with large images, explore tiling or multi-scale strategies to sustain quality without sacrificing speed.

Visualization matters for trust. Supervision’s annotator and plotting grid give you fast feedback, helping you debug mis-segmentations and confirm that masks align with domain expectations (medical, industrial, or consumer imagery). Once confident, export masks, convert to polygons, or integrate with datasets for training/upgrading other models.

Finally, keep experimenting: swap images, try different checkpoints, and measure runtime. This is a robust, extensible starting point—perfect for tutorials, demos, and production prototypes that demand fast, accurate, and repeatable segmentation.

If Vision Transformers interest you, see my guide: Build an Image Classifier with Vision Transformer (ViT).

Connect

☕ Buy me a coffee — https://ko-fi.com/eranfeit

🖥️ Email : feitgemel@gmail.com

🌐 https://eranfeit.net

🤝 Fiverr : https://www.fiverr.com/s/mB3Pbb

Enjoy,

Eran

Leave a Comment Cancel Reply