Skip to content

Eran Feit : Computer-Vision Hub
Tutorials
Blog
Contact page
- HTML Sitemap
Travel
Search for:

Buy me a coffee

Buy me a coffee

Home
My blog post
Image Classification
Object Detection
Image Segmentation
Unet
OpenCV
Python Cool Stuff
Jetson Nano
TensorFlow tutorials
Travel
Contact
HTML Sitemap

Mask R-CNN Python Tutorial: A Complete Guide to Instance Segmentation

/ Image Segmentation, Pytorch

Contents hide

1 What is Instance Segmentation vs. Semantic Segmentation?

1.1 Semantic Segmentation: The “Categorical” Approach

1.2 Instance Segmentation: The “Individual” approach

1.3 Why Mask R-CNN is the Standard

2 A friendly mask rcnn tutorial for instance segmentation in Python

3 Running a Complete Mask R-CNN Pipeline in Python

3.1 Master Computer Vision

4 Setting up Your Python Environment for Mask R-CNN

5 Loading the Pre-Trained Mask R-CNN Model and ResNet Backbone

6 Image Preprocessing: Tensors and COCO Dataset Normalization

7 Turning masks into a clean OpenCV overlay

8 Running Inference: Extracting Bounding Boxes and Masks

8.1 Quick wrap-up

9 FAQ – R‑CNN Tutorial

9.1 What is instance segmentation in Mask R-CNN?

9.2 Do I need training data for this mask rcnn tutorial?

9.3 Why does the code use COCO class names?

9.4 What does the threshold parameter actually do?

9.5 Why is the mask threshold set to 0.5?

9.6 Why does the code move the image tensor to CUDA?

9.7 Why does OpenCV color look wrong without conversion?

9.8 How can I improve readability of boxes and labels?

9.9 Can I run this mask rcnn tutorial on CPU only?

9.10 What is the easiest way to test URL images?

Last Updated on 05/05/2026 by Eran Feit

Object detection can tell you where an object is, but it falls short when you need the exact pixel boundaries. If you are struggling to move beyond basic bounding boxes, this Mask R-CNN Python tutorial for instance segmentation is exactly what you need. In this guide, we will bridge the gap between theoretical computer vision and practical implementation. You will learn how to configure a pre-trained model, properly process image tensors, and extract highly accurate, pixel-perfect masks for distinct objects in your images. Let’s dive into the code and mechanics behind state-of-the-art image segmentation.

A good mask rcnn tutorial should also explain why instance segmentation is different from the other common computer vision tasks.
Image classification tells you what exists in a whole image.
Object detection tells you where objects are using bounding boxes.
Instance segmentation goes one step further and produces a separate mask for every object instance, even when multiple objects share the same class.From a Python workflow perspective, Mask R-CNN is especially approachable because you can run it directly from widely used libraries.
With a pretrained model, you skip training entirely and jump straight into inference.
That makes it perfect for prototyping, automation, and building real projects where you need segmentation output quickly.
Once you have masks, you can visualize them, filter them by confidence, label them with COCO classes, and export the results for downstream steps.In real-world use, instance segmentation is often about presentation and usability as much as accuracy.
You want clean overlays, readable labels, and predictable output that you can save and reuse.
That’s why pairing the model output with practical tools like OpenCV matters.
It lets you draw boxes, blend colored masks, display the final result, and save an image that’s ready to share or debug.What is Instance Segmentation vs. Semantic Segmentation?Before diving into the code, it is crucial to understand where Mask R-CNN sits in the computer vision hierarchy. While both techniques involve assigning labels to pixels, the level of granularity and the way they handle individual objects are fundamentally different.Semantic Segmentation: The “Categorical” ApproachSemantic segmentation treats all pixels of the same class as a single entity. If an image contains five different cars, a semantic segmentation model will highlight every pixel belonging to “car” in the same color. It does not distinguish between individual vehicles; it simply maps out the “territory” of that class within the frame. This is commonly used in medical imaging (identifying a tumorous region) or land-cover mapping.Instance Segmentation: The “Individual” approachInstance segmentation—the core of this Mask R-CNN Python tutorial—goes a step further. It combines the strengths of Object Detection and Semantic Segmentation. It doesn’t just identify that “cars” are present; it identifies “Car 1,” “Car 2,” and “Car 3” as distinct individuals. Each instance is given its own unique mask and bounding box.The Key Differences at a Glance:

Subscription Form

Feature	Semantic Segmentation	Instance Segmentation (Mask R-CNN)
Object Distinction	Groups all objects of a class together.	Detects and separates individual objects.
Primary Goal	Understand the scene layout.	Understand the individual components.
Complexity	High (Pixel-level classification).	Very High (Detection + Pixel-level Masking).
Common Use Case	Autonomous driving (Road vs. Sidewalk).	Robotics (Picking up a specific tool).

Why Mask R-CNN is the StandardMask R-CNN (Region-based Convolutional Neural Network) is widely considered the gold standard for instance segmentation because it effectively adds a “mask branch” to the Faster R-CNN architecture. While the model detects the bounding box (localization) and the class (classification), it simultaneously predicts a binary mask for each detected object. This multi-task learning approach ensures that the model isn’t just guessing where an object is, but is understanding its precise geometric shape down to the pixel level.A friendly mask rcnn tutorial for instance segmentation in PythonA mask rcnn tutorial is most useful when it keeps the goal simple and concrete.
Take an input image, run a pretrained Mask R-CNN model, and return three things you can work with immediately: masks, bounding boxes, and class labels.
This setup makes the output easy to reason about and easy to visualize.
Instead of thinking in abstract tensors, you can treat the result like a set of detected objects, each with a shape and a name.The main target of this kind of tutorial is building a clean inference pipeline.
That means putting the model into evaluation mode, preparing the input image in the format the model expects, and running a forward pass safely and consistently.
It also means setting a confidence threshold so you can keep only the detections you actually trust.
Once the thresholding is in place, you’re no longer overwhelmed by low-confidence guesses, and the results look much more professional.At a high level, the pipeline has three layers.
First, load the pretrained model and move it to the best device available, usually a GPU when possible.
Second, convert the image into a tensor and run inference to get predictions, including scores, labels, boxes, and masks.
Third, turn those predictions into something visual and human-readable, like colored overlays and labeled rectangles.The visualization step is where instance segmentation becomes immediately useful.
A mask is more informative than a bounding box because it follows the object boundary instead of drawing a rough rectangle around it.
When you blend a semi-transparent colored mask onto the original image, you can see both the object and its exact shape at the same time.
Adding a label from the COCO class list helps you understand what the model believes the object is, and saving the final result makes the workflow repeatable for more images later.

More ways to understand segmentation fast

Segment Anything Python — No-Training Image Masks
Great for comparing Mask R-CNN to a modern, no-training mask workflow. Helps you understand what “good masks” look like in practice.
One-Click Segment Anything in Python (SAM ViT-H)
Shows an interactive way to generate masks quickly. Useful if you want a different perspective on prompts versus full-model inference.
Python Image Segmentation Made Easy with OpenCV and K-means
A classic OpenCV baseline that builds intuition for segmentation before deep learning. Helpful for readers who want a simpler mental model first.

Mask R-CNN Python tutorial for instance segmentation

Mask RCNN Tutorial

Running a Complete Mask R-CNN Pipeline in Python

Tip me and Download the code

This tutorial code is designed to help you go from a raw image to a fully visualized instance segmentation result using a pretrained Mask R-CNN model.
The main target of the code is not training or dataset preparation, but building a clean and reusable inference pipeline that works out of the box.
By focusing only on inference, the code stays practical, readable, and easy to adapt to different images and use cases.

At a high level, the code loads a Mask R-CNN model that was trained on the COCO dataset and prepares it for evaluation.
Once the model is ready, images can be provided either from a local file or directly from a URL.
The image is converted into the format expected by the model, sent through the network, and processed to extract masks, bounding boxes, confidence scores, and class labels.
This approach makes it easy to experiment with different inputs without changing the core logic.A key goal of the code is turning raw model output into something visually meaningful.
Predicted masks are thresholded, converted into binary form, and overlaid onto the original image using randomly generated colors.
Bounding boxes and class labels are drawn on top of the image so each detected instance is clearly identified.
This step bridges the gap between deep learning output and human-friendly visualization.The final stage of the pipeline focuses on usability and repeatability.
The segmented image is displayed using OpenCV and can also be saved directly to disk for later use.
Because all steps are wrapped into clear helper functions, the code can easily be reused in scripts, notebooks, or larger computer vision projects.
The result is a complete end-to-end Mask R-CNN inference workflow that demonstrates how instance segmentation can be applied in real Python applications.

Link to the video tutorial here Code for the tutorial here or here

Photo GPT AI Editor

Master Computer Vision

Follow my latest tutorials and AI insights on my Personal Blog.

Bootcamp

Beginner

Complete CV Bootcamp

Foundation using PyTorch & TensorFlow.

Get Started →

PyTorch

Interactive

Deep Learning with PyTorch

Hands-on practice in an interactive environment.

Start Learning →

GPT OpenCV

Advanced

Modern CV: GPT & OpenCV4

Vision GPT and production-ready models.

Go Advanced →

Mask R-CNN instance segmentation tutorial

Mask R-CNN instance segmentation overview

Mask R-CNN is one of the most practical ways to get high-quality instance segmentation without training a model from scratch.
This mask rcnn tutorial focuses on inference.
That means you load a pretrained model, run it on any image, and visualize the results in a way that’s easy to understand.The goal is simple.
Take an input image and return three things that matter in real projects: pixel masks, bounding boxes, and readable class labels.
Once you have those, you can overlay colored masks, draw rectangles, and save a final result you can reuse anywhere.This approach is perfect for prototyping.
You can test ideas quickly, validate what the model can detect, and build a reusable pipeline that works on local images or URL images.
You also learn the core structure of an instance segmentation workflow: preprocessing, inference, post-processing, and visualization.By the end of the code walkthrough, you will have a clean end-to-end script.
It installs the right libraries, loads the TorchVision Mask R-CNN model, runs predictions, overlays masks with OpenCV, and writes the output image to disk.
Everything is designed to be copy-paste friendly and easy to adapt.Setting up Your Python Environment for Mask R-CNNThis part is all about creating a stable Python environment that matches the exact versions used in the tutorial.
When you work with PyTorch, TorchVision, and CUDA, small version mismatches can cause confusing errors.
A clean Conda environment keeps things predictable.You are also installing OpenCV for visualization and saving results.
That matters because the code does more than inference.
It creates an output image with masks, boxes, and labels, then displays and saves it.

### Create a fresh Conda environment named Pytorch251 with Python 3.12 so the tutorial dependencies stay isolated.
conda create --name Pytorch251 python=3.12
### Activate the new environment so all installations apply to it.
conda activate Pytorch251

### Check your CUDA compiler version to confirm CUDA support on your machine.
nvcc --version

### Install PyTorch, TorchVision, and TorchAudio with CUDA 12.4 support so inference can run on GPU when available.
# Cuda 12.4
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia

### Install OpenCV for image loading, drawing masks, showing windows, and saving results.
# install
pip install opencv-python==4.10.0.84

### Create a fresh Conda environment named Pytorch251 with Python 3.12 so the tutorial dependencies stay isolated. conda create --name Pytorch251 python=3.12 ### Activate the new environment so all installations apply to it. conda activate Pytorch251  ### Check your CUDA compiler version to confirm CUDA support on your machine. nvcc --version  ### Install PyTorch, TorchVision, and TorchAudio with CUDA 12.4 support so inference can run on GPU when available. # Cuda 12.4 conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia  ### Install OpenCV for image loading, drawing masks, showing windows, and saving results. # install pip install opencv-python==4.10.0.84

Loading the Pre-Trained Mask R-CNN Model and ResNet BackboneThis code block is the setup step that makes the whole Mask R-CNN pipeline possible.
It loads everything you need to run instance segmentation inference in Python, without training a model or preparing a dataset.
Once this part is done, the model is ready to accept an image tensor and output masks, boxes, labels, and confidence scores.The first goal here is importing the right libraries.
PyTorch handles tensors and device placement, while TorchVision provides the pretrained Mask R-CNN architecture.
By using the official MaskRCNN_ResNet50_FPN_Weights enum, you make sure the correct pretrained weights and configuration are loaded safely and consistently.

Next, the code selects the default pretrained weights and creates the Mask R-CNN model with those weights.
This specific model uses a ResNet-50 backbone with an FPN feature pyramid, which is a strong and widely used baseline for instance segmentation.
The key benefit is that you get a model that already “knows” many common objects from COCO, so you can focus on inference and visualization.Finally, the code puts the model into evaluation mode and moves it to the GPU if possible.
model.eval() is important because it disables training behavior and ensures the model runs deterministically for inference.
Then, checking torch.cuda.is_available() lets the same script work on both GPU and CPU machines while automatically taking advantage of faster CUDA inference when it exists.

### Import PyTorch for tensors, device handling, and running inference.
import torch 
### Import TorchVision for pretrained detection and segmentation models.
import torchvision
### Import the official Mask R-CNN weights enum so we can load pretrained weights safely.
from torchvision.models.detection import MaskRCNN_ResNet50_FPN_Weights


### Choose the default pretrained weights for Mask R-CNN ResNet50-FPN.
weights = MaskRCNN_ResNet50_FPN_Weights.DEFAULT
### Load the pretrained Mask R-CNN model from TorchVision using the selected weights.
model = torchvision.models.detection.maskrcnn_resnet50_fpn(weights=weights)

### Switch the model to evaluation mode so it behaves correctly for inference.
model.eval()  # Set the model to evaluation mode
### If CUDA is available, move the model to GPU for faster predictions.
if torch.cuda.is_available():
    model = model.to('cuda')  # Move the model to GPU if available

### Import PyTorch for tensors, device handling, and running inference. import torch  ### Import TorchVision for pretrained detection and segmentation models. import torchvision ### Import the official Mask R-CNN weights enum so we can load pretrained weights safely. from torchvision.models.detection import MaskRCNN_ResNet50_FPN_Weights   ### Choose the default pretrained weights for Mask R-CNN ResNet50-FPN. weights = MaskRCNN_ResNet50_FPN_Weights.DEFAULT ### Load the pretrained Mask R-CNN model from TorchVision using the selected weights. model = torchvision.models.detection.maskrcnn_resnet50_fpn(weights=weights)  ### Switch the model to evaluation mode so it behaves correctly for inference. model.eval()  # Set the model to evaluation mode ### If CUDA is available, move the model to GPU for faster predictions. if torch.cuda.is_available():     model = model.to('cuda')  # Move the model to GPU if available

The environment is now ready and reproducible.
The model is loaded in inference mode and will use GPU if possible.Why we do this: Before feeding our image into the Mask R-CNN architecture, we must convert it into a tensor and scale the pixel values. Pre-trained Mask R-CNN models expect images normalized to a specific mean and standard deviation based on the COCO dataset. Failing to properly normalize your inputs will result in poor feature extraction by the neural network’s backbone, leading to dropped detections or wildly inaccurate segmentation masks.Image Preprocessing: Tensors and COCO Dataset NormalizationThis part prepares the label map so your results have readable class names.
The pretrained model predicts numeric label IDs, and the COCO label list converts those IDs into names like person, car, and dog.
That makes your visual output immediately understandable.You also build the core inference function that takes an image and returns masks, bounding boxes, and class labels.
The function supports both local paths and URLs.
It applies the correct tensor transform, runs the model, filters by confidence threshold, and returns only the detections you want.

### Define the full COCO instance category list so predicted label IDs can be converted into readable class names.
COCO_INSTLANCE_CATEGORY_NAMES = [
    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
    'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush' ]

### Print the number of classes to confirm the label list is loaded correctly.
print(len(COCO_INSTLANCE_CATEGORY_NAMES))  # Print the number of categories

### Import PIL Image for loading images in a TorchVision-friendly format.
from PIL import Image
### Import TorchVision transforms for converting PIL images into tensors.
from torchvision import transforms as T
### Import NumPy for array operations when converting masks and handling image data.
import numpy as np
### Import requests to download images from URLs when url=True.
import requests
### Import BytesIO to convert downloaded bytes into a file-like object for PIL.
from io import BytesIO

### The io and requests libraries are used to handle image data from URLs 

### Define a helper function that runs Mask R-CNN and returns masks, boxes, and class labels above a threshold.
def get_prediction(img_path, threshold=0.5 , url=False):
    ### If url=True, download the image and open it using PIL.
    if url: # we have requested an image from a URL
        response = requests.get(img_path)
        img = Image.open(BytesIO(response.content))
    ### Otherwise, load the image from the local filesystem.
    else:  # we have a local image file
        img = Image.open(img_path)
    
    ### Convert the image into a tensor so the model can process it.
    transform = T.Compose([T.ToTensor()])  # Define the transformation to convert the image to a tensor
    ### Apply the transform to the image.
    img = transform(img)  # Apply the transformation to the image
    ### Move the tensor to GPU for faster inference.
    img = img.cuda()
    ### Run inference by passing a list containing the image tensor.
    pred = model([img])  # Send the image to the model for prediction

    ### Extract confidence scores for each detected instance.
    pred_score = list(pred[0]['scores'].detach().cpu().numpy())  # Get the prediction scores
    ### Find the last index where score > threshold, so we keep only confident detections.
    pred_t = [pred_score.index(x) for x in pred_score if x > threshold][-1] 
    ### Convert predicted masks into a boolean mask array and move it back to CPU.
    masks = (pred[0]['masks'] > 0.5).squeeze().detach().cpu().numpy()  # Get the masks from the predictions
    ### Convert predicted label IDs into readable COCO class names.
    pred_class = [COCO_INSTLANCE_CATEGORY_NAMES[i] for i in list(pred[0]['labels'].detach().cpu().numpy())]  # Get the predicted classes
    ### Convert predicted bounding boxes into coordinate pairs for drawing rectangles later.
    pred_boxes = [[(i[0], i[1]), (i[2], i[3])] for i in list(pred[0]['boxes'].detach().cpu().numpy())]  # Get the bounding boxes

    ### Keep only detections up to the last index that passed the threshold.
    masks = masks[:pred_t + 1]  # Select the masks up to the threshold
    ### Keep only the matching bounding boxes.
    pred_boxes = pred_boxes[:pred_t + 1]  # Select the bounding boxes up to the threshold
    ### Keep only the matching class labels.
    pred_class = pred_class[:pred_t + 1]  # Select the classes up to the threshold

    ### Return the filtered masks, boxes, and class labels.
    return masks, pred_boxes, pred_class  # Return the masks, bounding boxes, and classes

### Define the full COCO instance category list so predicted label IDs can be converted into readable class names. COCO_INSTLANCE_CATEGORY_NAMES = [     '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',     'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',     'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',     'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',     'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',     'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',     'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',     'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',     'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',     'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',     'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',     'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush' ]  ### Print the number of classes to confirm the label list is loaded correctly. print(len(COCO_INSTLANCE_CATEGORY_NAMES))  # Print the number of categories  ### Import PIL Image for loading images in a TorchVision-friendly format. from PIL import Image ### Import TorchVision transforms for converting PIL images into tensors. from torchvision import transforms as T ### Import NumPy for array operations when converting masks and handling image data. import numpy as np ### Import requests to download images from URLs when url=True. import requests ### Import BytesIO to convert downloaded bytes into a file-like object for PIL. from io import BytesIO  ### The io and requests libraries are used to handle image data from URLs   ### Define a helper function that runs Mask R-CNN and returns masks, boxes, and class labels above a threshold. def get_prediction(img_path, threshold=0.5 , url=False):     ### If url=True, download the image and open it using PIL.     if url: # we have requested an image from a URL         response = requests.get(img_path)         img = Image.open(BytesIO(response.content))     ### Otherwise, load the image from the local filesystem.     else:  # we have a local image file         img = Image.open(img_path)          ### Convert the image into a tensor so the model can process it.     transform = T.Compose([T.ToTensor()])  # Define the transformation to convert the image to a tensor     ### Apply the transform to the image.     img = transform(img)  # Apply the transformation to the image     ### Move the tensor to GPU for faster inference.     img = img.cuda()     ### Run inference by passing a list containing the image tensor.     pred = model([img])  # Send the image to the model for prediction      ### Extract confidence scores for each detected instance.     pred_score = list(pred[0]['scores'].detach().cpu().numpy())  # Get the prediction scores     ### Find the last index where score > threshold, so we keep only confident detections.     pred_t = [pred_score.index(x) for x in pred_score if x > threshold][-1]      ### Convert predicted masks into a boolean mask array and move it back to CPU.     masks = (pred[0]['masks'] > 0.5).squeeze().detach().cpu().numpy()  # Get the masks from the predictions     ### Convert predicted label IDs into readable COCO class names.     pred_class = [COCO_INSTLANCE_CATEGORY_NAMES[i] for i in list(pred[0]['labels'].detach().cpu().numpy())]  # Get the predicted classes     ### Convert predicted bounding boxes into coordinate pairs for drawing rectangles later.     pred_boxes = [[(i[0], i[1]), (i[2], i[3])] for i in list(pred[0]['boxes'].detach().cpu().numpy())]  # Get the bounding boxes      ### Keep only detections up to the last index that passed the threshold.     masks = masks[:pred_t + 1]  # Select the masks up to the threshold     ### Keep only the matching bounding boxes.     pred_boxes = pred_boxes[:pred_t + 1]  # Select the bounding boxes up to the threshold     ### Keep only the matching class labels.     pred_class = pred_class[:pred_t + 1]  # Select the classes up to the threshold      ### Return the filtered masks, boxes, and class labels.     return masks, pred_boxes, pred_class  # Return the masks, bounding boxes, and classes

You now have a reusable prediction function.
It supports local images and URL images and returns clean, filtered outputs for visualization.

If you want to go beyond pretrained inference

Detectron2 Custom Dataset Training Made Easy
Perfect next step if you want to train your own Mask R-CNN on custom objects. It connects naturally to the idea of masks and instance segmentation.
Make Instance Segmentation Easy with Detectron2
Shows another popular framework approach to instance segmentation. Useful for comparing TorchVision inference to Detectron2 workflows.
How to Use Grounding DINO with Segment Anything Tutorial
Introduces a detect-then-segment pipeline. Great for readers who want text-guided detection plus precise masks.

Turning masks into a clean OpenCV overlayThis part is where the results become visually meaningful.
Raw masks are just arrays, but with color overlays you can immediately see object shapes and boundaries.
This is the moment where instance segmentation feels “real” because you can compare the mask to the pixels underneath it.The code also adds a simple color strategy and a helper for URL-to-OpenCV conversion.
That keeps the pipeline flexible.
Whether your image comes from disk or the internet, you end up with the same OpenCV-ready image that can be drawn on and saved.

### Import Matplotlib in case you want additional plotting or debugging visuals.
import matplotlib.pyplot as plt
### Import OpenCV for reading images, drawing boxes, blending masks, and saving output.
import cv2 
### Import random so each instance mask can receive a different color.
import random 


### Import urlopen to fetch raw bytes from an image URL when using OpenCV decoding.
from urllib.request import urlopen

### Define a helper that downloads an image from a URL and decodes it into an OpenCV array.
def url_to_image(url , readFlag=cv2.IMREAD_COLOR):
    ### Open the URL stream and read the bytes.
    resp = urlopen(url)
    ### Convert bytes into a NumPy array.
    image = np.asarray(bytearray(resp.read()), dtype="uint8")
    ### Decode the image bytes into an OpenCV image matrix.
    image = cv2.imdecode(image, readFlag)  # Decode the image from the URL
    ### Return the OpenCV image.
    return image  # Return the decoded image

### Create a colored mask image by assigning a random RGB color to mask pixels.
def random_color_masks(image):

    ### Define a palette of preset colors so masks look consistent and readable.
    # List of random colors for the masks
    colors= [[0, 255, 0],[0, 0, 255],[255, 0, 0],[0, 255, 255],[255, 255, 0],[255, 0, 255],[80, 70, 180], [250, 80, 190],[245, 145, 50],[70, 150, 250],[50, 190, 190]]

    ### Create empty channels for building a colored overlay.
    r = np.zeros_like(image, dtype=np.uint8)  # Create a red channel for the mask
    g = np.zeros_like(image, dtype=np.uint8)  # Create a green channel for the mask
    b = np.zeros_like(image, dtype=np.uint8)  # Create a blue channel for the mask

    ### Paint mask pixels with a randomly selected color from the palette.
    r[image==1], g[image==1], b[image==1] = colors[random.randrange(0, 10)]
    ### Stack channels into a 3-channel color mask.
    colored_mask = np.stack([b, g, r], axis=2)
    ### Return the colored mask overlay.
    return colored_mask  # Stack the channels to create a color mask


### Build the full instance segmentation pipeline that runs inference and draws overlays.
def instance_segmentation(img_path , threshold=0.6, rect_th=1, text_size=1, text_th=1, url=False):
    ### Run the model and get masks, boxes, and class labels.
    masks , boxes , pred_cls = get_prediction(img_path, threshold=threshold, url=url)  # Get the predictions

    ### Load the image using a URL loader when url=True.
    if url:
        img = url_to_image(img_path)  # Load the image from the URL
    ### Otherwise read from disk using OpenCV.
    else:
        img = cv2.imread(img_path)  # Read the image from the local path

    ### Convert BGR to RGB so the visualization looks correct.
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # Convert the image to RGB format

    ### Loop through each detected instance and draw its mask, box, and label.
    for i in range(len(masks)):
        ### Convert the i-th mask into a random colored overlay.
        rgb_mask = random_color_masks(masks[i])  # Get a random color mask for the instance
        ### Blend the colored mask with the original image.
        img = cv2.addWeighted(img, 1, rgb_mask, 0.5, 0)  # Overlay the mask on the image
        ### Convert predicted box coordinates into integer tuples for OpenCV.
        pt1 = tuple(int(x) for x in boxes[i][0])  # Get the top-left corner of the bounding box
        ### Convert predicted box coordinates into integer tuples for OpenCV.
        pt2 = tuple(int(x) for x in boxes[i][1])  # Get the bottom-right corner of the bounding box
        ### Draw the rectangle around the detected instance.
        cv2.rectangle(img, pt1, pt2, (0, 255, 0), thickness=rect_th)  # Draw the bounding box on the image
        ### Put the class name text near the top-left of the box.
        cv2.putText(img, pred_cls[i], pt1, cv2.FONT_HERSHEY_SIMPLEX, text_size, (0, 255, 0), thickness=text_th)  # Add the class label to the image

    ### Return the final image plus the class list and the last mask.
    return img, pred_cls, masks[i]

### Import Matplotlib in case you want additional plotting or debugging visuals. import matplotlib.pyplot as plt ### Import OpenCV for reading images, drawing boxes, blending masks, and saving output. import cv2  ### Import random so each instance mask can receive a different color. import random    ### Import urlopen to fetch raw bytes from an image URL when using OpenCV decoding. from urllib.request import urlopen  ### Define a helper that downloads an image from a URL and decodes it into an OpenCV array. def url_to_image(url , readFlag=cv2.IMREAD_COLOR):     ### Open the URL stream and read the bytes.     resp = urlopen(url)     ### Convert bytes into a NumPy array.     image = np.asarray(bytearray(resp.read()), dtype="uint8")     ### Decode the image bytes into an OpenCV image matrix.     image = cv2.imdecode(image, readFlag)  # Decode the image from the URL     ### Return the OpenCV image.     return image  # Return the decoded image  ### Create a colored mask image by assigning a random RGB color to mask pixels. def random_color_masks(image):      ### Define a palette of preset colors so masks look consistent and readable.     # List of random colors for the masks     colors= [[0, 255, 0],[0, 0, 255],[255, 0, 0],[0, 255, 255],[255, 255, 0],[255, 0, 255],[80, 70, 180], [250, 80, 190],[245, 145, 50],[70, 150, 250],[50, 190, 190]]      ### Create empty channels for building a colored overlay.     r = np.zeros_like(image, dtype=np.uint8)  # Create a red channel for the mask     g = np.zeros_like(image, dtype=np.uint8)  # Create a green channel for the mask     b = np.zeros_like(image, dtype=np.uint8)  # Create a blue channel for the mask      ### Paint mask pixels with a randomly selected color from the palette.     r[image==1], g[image==1], b[image==1] = colors[random.randrange(0, 10)]     ### Stack channels into a 3-channel color mask.     colored_mask = np.stack([b, g, r], axis=2)     ### Return the colored mask overlay.     return colored_mask  # Stack the channels to create a color mask   ### Build the full instance segmentation pipeline that runs inference and draws overlays. def instance_segmentation(img_path , threshold=0.6, rect_th=1, text_size=1, text_th=1, url=False):     ### Run the model and get masks, boxes, and class labels.     masks , boxes , pred_cls = get_prediction(img_path, threshold=threshold, url=url)  # Get the predictions      ### Load the image using a URL loader when url=True.     if url:         img = url_to_image(img_path)  # Load the image from the URL     ### Otherwise read from disk using OpenCV.     else:         img = cv2.imread(img_path)  # Read the image from the local path      ### Convert BGR to RGB so the visualization looks correct.     img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # Convert the image to RGB format      ### Loop through each detected instance and draw its mask, box, and label.     for i in range(len(masks)):         ### Convert the i-th mask into a random colored overlay.         rgb_mask = random_color_masks(masks[i])  # Get a random color mask for the instance         ### Blend the colored mask with the original image.         img = cv2.addWeighted(img, 1, rgb_mask, 0.5, 0)  # Overlay the mask on the image         ### Convert predicted box coordinates into integer tuples for OpenCV.         pt1 = tuple(int(x) for x in boxes[i][0])  # Get the top-left corner of the bounding box         ### Convert predicted box coordinates into integer tuples for OpenCV.         pt2 = tuple(int(x) for x in boxes[i][1])  # Get the bottom-right corner of the bounding box         ### Draw the rectangle around the detected instance.         cv2.rectangle(img, pt1, pt2, (0, 255, 0), thickness=rect_th)  # Draw the bounding box on the image         ### Put the class name text near the top-left of the box.         cv2.putText(img, pred_cls[i], pt1, cv2.FONT_HERSHEY_SIMPLEX, text_size, (0, 255, 0), thickness=text_th)  # Add the class label to the image      ### Return the final image plus the class list and the last mask.     return img, pred_cls, masks[i]

You now have a full visualization pipeline.
It overlays masks, draws bounding boxes, and prints readable labels directly on the image.

More segmentation workflows to compare with Mask R-CNN

How to segment multiple objects with YOLO Python
Shows a modern segmentation workflow using YOLO segmentation models. Useful for comparing speed and output style to Mask R-CNN.
Instance Segmentation Python Tutorial Using YOLO Models In Videos
Extends segmentation to video workflows. Helpful if readers want to apply masks frame-by-frame beyond single images.
Boost Your Dataset with YOLOv8 Auto-Label Segmentation
Shows how segmentation masks can be used to create labeled datasets automatically. A practical bridge from inference to real training projects.

Running Inference: Extracting Bounding Boxes and MasksThis final part shows the “end of the pipeline” behavior.
You point the script at an image path, run instance segmentation, and display the result in an OpenCV window.
This is the fastest way to confirm everything is working.It also saves the final output image to disk.
Saving matters because it turns the tutorial into a reusable tool.
Once you are happy with results, you can run the same pipeline on many images and automatically generate segmented outputs.Here is the test image :

mask rcnn tutorial

Mask R-CNN Tutorial Test Image

### Set the input image path you want to segment.
# Run the Segmenation
img_path = "Best-Semantic-Segmentation-models\Mask RCNN/the-last-of-us.jpg"
### Run the instance segmentation pipeline and control rectangle and text thickness for visibility.
img , pred_classes , masks = instance_segmentation(img_path, rect_th=5 , text_th=4)
### Convert RGB to BGR because OpenCV expects BGR when showing images.
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # Convert the image to RGB format
### Display the final instance segmentation result.
cv2.imshow("Instance Segmentation", img)  # Display the segmented image
### Wait for a key press so the window stays open.
cv2.waitKey(0)  # Wait for a key press
### Close all OpenCV windows cleanly.
cv2.destroyAllWindows()  # Close all OpenCV windows
### Save the final segmented output image to disk.
# Save the segmented image
cv2.imwrite("d:/temp/instance_segmentation_result.jpg", img)  # Save the segmented image to disk

### Set the input image path you want to segment. # Run the Segmenation img_path = "Best-Semantic-Segmentation-models\Mask RCNN/the-last-of-us.jpg" ### Run the instance segmentation pipeline and control rectangle and text thickness for visibility. img , pred_classes , masks = instance_segmentation(img_path, rect_th=5 , text_th=4) ### Convert RGB to BGR because OpenCV expects BGR when showing images. img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # Convert the image to RGB format ### Display the final instance segmentation result. cv2.imshow("Instance Segmentation", img)  # Display the segmented image ### Wait for a key press so the window stays open. cv2.waitKey(0)  # Wait for a key press ### Close all OpenCV windows cleanly. cv2.destroyAllWindows()  # Close all OpenCV windows ### Save the final segmented output image to disk. # Save the segmented image cv2.imwrite("d:/temp/instance_segmentation_result.jpg", img)  # Save the segmented image to disk

At this point, the script is fully end-to-end.
You can load an image, run Mask R-CNN inference, visualize instance masks, and save a clean result.Pro-Tip on Mask Tensors: The output from the model isn’t just a simple image; it is a dictionary containing bounding box coordinates, class confidence scores, and raw mask tensors. Keep in mind that these raw mask outputs are soft probabilities (values between 0.0 and 1.0). To draw a crisp, clear outline using OpenCV, you must apply a binary threshold (typically 0.5) to convert these probabilities into a hard boolean mask before overlaying them on your original image.Here is the result :

mask rcnn tutorial

Mask R-CNN Tutorial Result

Quick wrap-upThis mask rcnn tutorial code demonstrates a practical instance segmentation workflow that you can reuse immediately.
You load a pretrained TorchVision Mask R-CNN model, run inference on any image, overlay colored masks with OpenCV, and save the final output.Expert Insight for Optimization: While this script works perfectly for single-image inference, computational overhead becomes a bottleneck when applying Mask R-CNN to real-time video streams. If you plan to process video, ensure you wrap your inference loop in a torch.no_grad() block (if using PyTorch). This prevents the framework from storing computational graphs in memory, which dramatically reduces VRAM consumption and improves your Frames Per Second (FPS).FAQ – R‑CNN Tutorial

What is instance segmentation in Mask R-CNN?

Instance segmentation predicts a separate pixel mask for every object instance. Mask R-CNN outputs both the mask and the bounding box for each detection.

Do I need training data for this mask rcnn tutorial?

No. This tutorial uses a pretrained TorchVision model and runs inference only.

Why does the code use COCO class names?

The model outputs label IDs. The COCO list converts those IDs into readable names for drawing text on the image.

What does the threshold parameter actually do?

It filters detections by confidence score. Raising the threshold reduces false positives but can remove smaller objects.

Why is the mask threshold set to 0.5?

Mask R-CNN predicts probabilities per pixel. A 0.5 cutoff is a common default to convert probabilities into a binary mask.

Why does the code move the image tensor to CUDA?

Inference is much faster on GPU. The model and the input tensor must be on the same device to avoid runtime errors.

Why does OpenCV color look wrong without conversion?

OpenCV loads images as BGR by default. Converting to RGB keeps colors correct when blending masks and displaying output.

How can I improve readability of boxes and labels?

Increase rect_th and text_th to make lines bolder. You can also increase text_size for higher-resolution images.

Can I run this mask rcnn tutorial on CPU only?

Yes, but you must avoid calling .cuda() on the image tensor. For CPU-only runs, keep both the model and tensors on CPU.

What is the easiest way to test URL images?

Set url=True and pass a direct image URL into the function. The helper downloads and decodes the image before visualization.

ConclusionThis mask rcnn tutorial gives you a complete instance segmentation pipeline that is ready to reuse.
Instead of spending time on training and datasets, you focus on what most people need first: reliable inference and clean visualization.
You load a pretrained TorchVision Mask R-CNN model, run it on any image, and turn predictions into masks, boxes, and readable labels.The real value is the structure.
The code separates responsibilities into clear helper functions so you can swap inputs, tune thresholds, and customize overlays without rewriting everything.
Once you understand this baseline, you can extend it to batch processing, video frames, custom labeling, or more advanced post-processing.
This is the kind of foundation that makes instance segmentation feel practical, not abstract.

Connect :☕ Buy me a coffee — https://ko-fi.com/eranfeit🖥️ Email : feitgemel@gmail.com🌐 https://eranfeit.net🤝 Fiverr : https://www.fiverr.com/s/mB3PbbEnjoy,Eran

← Previous Post

Subscribe to Our Newsletter

Enter your email to receive new insights, tutorials, and project updates directly in your inbox.

Email

The form has been submitted successfully!

There has been some error while submitting the form. Please verify all form fields again.

Eran Feit logo

Copyright © 2026 Eran Feit

Powered by Eran Feit

Home
My blog post
Image Classification
Object Detection
Image Segmentation
Unet
OpenCV
Python Cool Stuff
Jetson Nano
TensorFlow tutorials
Travel
Contact
HTML Sitemap