Skip to content

Eran Feit : Computer-Vision Hub
Tutorials
Blog
Contact page
Travel
HTML Sitemap

Buy me a coffee

Buy me a coffee

Home
My blog post
Image Classification
Object Detection
Image Segmentation
Unet
OpenCV
Python Cool Stuff
Jetson Nano
TensorFlow tutorials
Travel
Contact
HTML Sitemap

Hair segmentation using Transformers | UNETR Image Segmentation

/ Uncategorized

Contents hide

1 What Unetr image segmentation Really Means (At a High Level)

2 Building a Practical UNETR Image Segmentation Pipeline with Python & TensorFlow

2.1 Master Computer Vision

3 Hair segmentation using Transformers | UNETR Image Segmentation

3.1 Technical Prerequisites

4 Setting Up Your Python Environment for UNETR

5 Preparing the Dataset from Parquet Files

6 Building the UNETR Training Pipeline

6.1 The UNETR Advantage

7 Training the UNETR Segmentation Model

8 Custom Dice Metric and the UNETR Architecture Definition

9 Testing the Trained UNETR Model on New Images

9.1 Performance Optimization

10 FAQ — UNETR Image Segmentation

10.1 What is UNETR image segmentation?

10.2 Why use transformers for image segmentation?

10.3 What dataset is used here?

10.4 What is Dice loss?

10.5 Can the model run on CPU?

Last Updated on 28/04/2026 by Eran Feit

Precise hair segmentation remains one of the most challenging tasks in computer vision due to the fine, irregular boundaries and varying textures of human hair. While traditional CNNs like U-Net excel at local feature extraction, they often struggle with the global context required for complex occlusions. In this guide, you will master Hair Segmentation using UNETR Transformers in Python. By leveraging the power of Vision Transformers (ViT) within an encoder-decoder framework, we will solve the problem of boundary blurring, allowing you to generate high-fidelity semantic masks for augmented reality or portrait editing applications.

At its core, UNETR (short for UNEt with TRansformers) bridges two influential ideas: the hierarchical representation learning of CNNs and the self-attention mechanisms of transformers. In semantic segmentation, the goal is to assign a class label to each pixel in an image. This requires both precise local feature extraction to determine edges and textures and a broad understanding of the image layout to distinguish between objects. UNETR tackles this by encoding patches of the input image into a sequence of embeddings that the transformer layers can process, preserving spatial information while learning intricate patterns.The application of Unetr image segmentation is particularly beneficial for detailed segmentation tasks where context matters. For example, in medical imaging, identifying fine structures within scans or isolating specific tissues requires models that can interpret subtle variations across different regions of the image. The transformer backbone excels at learning global context, helping to resolve ambiguities that local filters alone might miss. Similarly, in tasks like hair segmentation in real-world photos, the contours and boundaries of hair can be complex and varied, making transformer-enhanced models well suited for the job.Training a UNETR model involves preparing datasets of images and corresponding pixel masks, often converting raw data formats into structured image files. Once trained, the model can predict segmentation masks on new images, enabling applications like background removal, augmented reality, or detailed analysis in automated systems. The blend of transformer capabilities with the structural decoding power of U-Net gives UNETR models a distinct edge in challenging segmentation scenarios.What Unetr image segmentation Really Means (At a High Level)When we talk about Unetr image segmentation, we’re discussing a modern method for teaching machines to understand exactly which pixels belong to which object in an image — and doing it with a backbone that’s inspired by transformer networks. Unlike older techniques that rely solely on convolutional layers to scan for patterns in small local patches, UNETR introduces a mechanism for the model to look across the entire image at once.At a high-level, the goal of UNETR is to maintain a rich understanding of both local details and global context. Image segmentation is inherently a pixel-precise task; if the model mislabels even a small border, the output can look unnatural or be unusable for sensitive applications. Transformers help by using self-attention — a way for the model to decide which parts of the image relate to each other, regardless of spatial distance — which is especially useful in complex scenes where the correct interpretation relies on understanding larger structures.Training a UNETR model usually starts with splitting an image into smaller patches, then converting those patches into a sequence that a transformer can process. This is similar to how sentences are tokenized in language models, except here the tokens represent image regions. The transformer encoder then processes this sequence to capture relationships across the full image. Meanwhile, the decoder combines these learned representations back into a full-size image mask, reconstructing each pixel’s label.The output of Unetr image segmentation gives you a pixel-wise classification map — a mask — that highlights the object of interest (for example, hair in a photograph). These masks can be used for visualization, further analysis, or downstream tasks like compositing or real-time video processing. Because UNETR leverages both transformer vision capability and encoder-decoder structure, it performs particularly well on tasks where complexity and detail are equally important.

Hair Segmentation using UNETR Transformers in Python

Hair segmentation using UNETR architecture

Building a Practical UNETR Image Segmentation Pipeline with Python & TensorFlowThis tutorial walks through a complete, end-to-end workflow for unetr image segmentation, focusing not just on the theory — but on real, working Python code. The goal of the project is to train a UNETR-based segmentation model that can accurately detect and segment hair in images. To do that, the code covers everything from preparing the dataset, converting Parquet files into usable image-mask pairs, preprocessing them into patches for the transformer encoder, training the UNETR model, and finally testing it on unseen data to generate segmentation masks.The first stage of the codebase makes sure your working environment is correctly set up. A dedicated Conda environment is created so dependencies like TensorFlow, OpenCV, PyArrow, and supporting libraries are installed cleanly. This ensures compatibility with the model code and avoids version conflicts — something that is especially important when dealing with deep learning frameworks. Once the environment is configured, the dataset is downloaded and converted from Parquet format into regular image and mask files that are easy to load and process.From there, the focus shifts to preparing the data for the model. The code reads each image and mask, resizes them to a standard resolution, and converts images into small patches. These patches are then flattened and fed into the transformer encoder inside the UNETR architecture. This patch-based strategy allows the model to learn both local detail and global context. Masks are normalized and shaped so the model learns pixel-wise classification, which is essential for segmentation tasks.The final part of the workflow is where everything comes together. The model is trained with callbacks for checkpointing, logging, and learning rate scheduling, helping it converge more effectively. Once trained, the code loads the best model and runs it on a set of test images. The predictions are reconstructed back into full-resolution masks and displayed side-by-side with the original image and ground-truth labels. This gives a clear visual comparison of how well the UNETR model learned to perform hair segmentation — closing the loop from raw dataset to meaningful, real-world output.

Link to the video tutorial : https://youtu.be/f1UdSemIlh0You can download the code here : https://eranfeit.lemonsqueezy.com/checkout/buy/bc36d134-06aa-46fd-b9e3-8bdd0f907574 or here : https://ko-fi.com/s/11e9fbdea9

Photo GPT AI Editor

Link to the post for Medium.com users : https://medium.com/vision-transformers-tutorials/hair-segmentation-using-transformers-unetr-image-segmentation-3a762474e58dReinvent your look in seconds by uploading your photo and using Funy AI’s Hairstyle and Hair Color changers. This all in one platform also provides an extensive suite of free AI tools to generate and edit videos and images – from text to video and gender swap videos to headshot generators, background removers and photo enhancers explore the full toolkit here

Master Computer Vision

Follow my latest tutorials and AI insights on my Personal Blog.

Bootcamp

Beginner

Complete CV Bootcamp

Foundation using PyTorch & TensorFlow.

Get Started →

PyTorch

Interactive

Deep Learning with PyTorch

Hands-on practice in an interactive environment.

Start Learning →

GPT OpenCV

Advanced

Modern CV: GPT & OpenCV4

Vision GPT and production-ready models.

Go Advanced →

UNETR image segmentation process visualized

UNETR image segmentation process visualized

Hair segmentation using Transformers | UNETR Image Segmentationunetr image segmentation is a powerful deep learning technique that combines modern transformer architectures with traditional encoder-decoder segmentation pipelines.
In this tutorial you will see a full hands-on workflow in Python and TensorFlow — from setting up your environment, preparing the dataset, converting raw Parquet files into image masks, training a UNETR model, and finally testing it.
Each section presents the code, explained in a human-friendly way so you not only run it but understand the purpose of every step.

Related Segmentation Tutorials

U-Net Image Segmentation with TensorFlow/Keras
This U-Net tutorial shows how to build and train a segmentation model — a good foundational guide before tackling UNETR.
OpenCV Image Segmentation in Python
Before deep learning, this OpenCV contours segmentation article explains traditional segmentation techniques.
One-Click Segment Anything with SAM ViT-H
An accessible segmentation tutorial that uses a vision transformer model in practice.

Technical PrerequisitesBefore we dive into the implementation, it is important to understand the environment required for successful Hair Segmentation using UNETR Transformers in Python. This project utilizes the MONAI framework and PyTorch to handle the heavy lifting of the Vision Transformer (ViT) architecture. Setting up your GPU environment correctly ensures that the self-attention mechanisms—the core of why we use Hair Segmentation using UNETR Transformers in Python—can process long-range spatial dependencies without memory bottlenecks.Setting Up Your Python Environment for UNETRBefore diving into segmentation code, you need a stable Python environment with all necessary libraries.
In this first part, the code creates a Conda environment and installs specific versions of core Python packages that ensure compatibility with TensorFlow, OpenCV, and data processing tools used throughout the workflow.
This setup is critical for reproducibility and avoiding version conflicts when you run heavy deep learning training routines.Below is the setup code that you run once to prepare your system.

conda create -n UnetR python=3.11 conda activate UnetR  pip install pandas==2.2.3 pip install pyarrow==18.1.0 pip install pillow==11.0.0 pip install tqdm==4.67.1  pip install tensorflow[and-cuda]==2.17.1 pip install tensorflow==2.17.1  pip install opencv-python==4.10.0.84 pip install scikit-learn==1.6.0 pip install patchify==0.2.3

Summary:
You now have a dedicated Python environment with all the right libraries to run the UNETR hair segmentation pipeline.Preparing the Dataset from Parquet FilesIn this section you take the Figaro hair segmentation dataset, which is stored in Parquet format, and convert it into standard image files.
Deep learning models work with image and mask pairs on disk, so this step converts raw bytes in Parquet fields into PNGs you can load in training.The following code loops through every row of the Parquet file, saves the corresponding image and mask to appropriate folders, and builds a structured dataset on disk.

### Import the pandas library to handle Parquet file reading and manipulation. import pandas as pd ### Import the os library for file and directory operations. import os ### Import the Image class from PIL to convert raw bytes to image files. from PIL import Image ### Import io to treat raw bytes like file streams. import io ### Import tqdm for showing a progress bar over iterations. from tqdm import tqdm  ### Define a function to extract images and labels from a Parquet file and save them as PNGs. def extract_images_from_parquet(parquet_file, output_base_folder, dataset_type):     ### Load the Parquet file into a pandas DataFrame.     df = pd.read_parquet(parquet_file)          ### Create folders for images and mask labels.     dataset_folder = os.path.join(output_base_folder, dataset_type)     image_folder = os.path.join(dataset_folder, "images")     label_folder = os.path.join(dataset_folder, "masks")     os.makedirs(image_folder, exist_ok=True)     os.makedirs(label_folder, exist_ok=True)      ### Loop through every row and save the image and label with a progress bar.     print(f"Processing {dataset_type} dataset...")     for idx, row in tqdm(df.iterrows(), total=len(df), desc=f"Saving {dataset_type} images"):         image_data = row['image']         if isinstance(image_data, dict) and 'bytes' in image_data:             image = Image.open(io.BytesIO(image_data['bytes']))             image.save(os.path.join(image_folder, f"image_{idx}.png"))                  label_data = row['label']         if isinstance(label_data, dict) and 'bytes' in label_data:             label = Image.open(io.BytesIO(label_data['bytes']))             label.save(os.path.join(label_folder, f"label_{idx}.png"))      print(f"{dataset_type.capitalize()} images and labels saved successfully.")  ### Set local paths for Parquet files for train and test sets. train_parquet_file = "D:/Data-Sets-Object-Segmentation/figaro_hair_segmentation_1000/train-00000-of-00001-910d2af14081f419.parquet" test_parquet_file = "D:/Data-Sets-Object-Segmentation/figaro_hair_segmentation_1000/validation-00000-of-00001-55044d1c657fc998.parquet"  ### Define where to save the extracted images and masks. output_folder = "D:/Data-Sets-Object-Segmentation/figaro_hair_segmentation_1000"  ### Extract and save images and labels for both train and test sets. extract_images_from_parquet(train_parquet_file, output_folder, "train") extract_images_from_parquet(test_parquet_file, output_folder, "test")

Summary:
Your dataset now lives in a clean folder structure with PNG images and corresponding masks — ready for training.

More Deep Learning Image Tutorials

Build U-Net Image Segmentation Models
This post walks through a full U-Net segmentation pipeline that complements UNETR-based learning.
Segment Anything in Python
See how transformer-based SAM segmentation works with one-click annotation.

Building the UNETR Training PipelineWith the data prepared on disk, it’s time to define how you’ll train the UNETR model.
This part of the code loads the saved images and masks, prepares them into patch-based input suitable for transformers, and defines the model training loop with callbacks to save the best results and control learning.The goal here is to take raw images, split them into patches (so transformers can process them), normalize values, and feed them into a customized UNETR model that combines transformer encoding and U-Net style decoding for segmentation.The transition from standard U-Nets to UNETR (U-Net Transformer) represents a paradigm shift in how we handle spatial dependencies. While traditional convolutions operate on a fixed local neighborhood, the Transformer-based encoder in UNETR utilizes self-attention mechanisms to capture long-range dependencies. This is particularly crucial for hair segmentation, where the model must distinguish between fine hair strands and complex background textures across the entire image dimensions.

### Import OS for handling paths and directories. import os ### Import NumPy for numerical operations on images and patches. import numpy as np ### Import CV2 for image loading and resizing. import cv2 ### Import glob for listing image files. from glob import glob ### Import shuffle utility from sklearn (not used directly but useful for randomization). from sklearn.utils import shuffle ### Import TensorFlow and relevant Keras components. import tensorflow as tf from tensorflow.keras.callbacks import ModelCheckpoint, CSVLogger, ReduceLROnPlateau, EarlyStopping from tensorflow.keras.optimizers import Adam, SGD from sklearn.model_selection import train_test_split ### Import patchify to break images into smaller patches. from patchify import patchify  ### Import UNETR model builder and loss function definitions. from unetr_2d import build_unetr_2d from metrics import dice_loss  ### Define configuration for UNETR (image size, patch size, number of transformer layers). cf = {} cf["image_size"] = 256 cf["num_channels"] = 3 cf["num_layers"] = 12 cf["hidden_dim"] = 128 cf["mlp_dim"] = 32 cf["num_heads"] = 6 cf["dropout_rate"] = 0.1 cf["patch_size"] = 16 cf["num_patches"] = (cf["image_size"]**2)//(cf["patch_size"]**2) cf["flat_patches_shape"] = (     cf["num_patches"],     cf["patch_size"]*cf["patch_size"]*cf["num_channels"] )  ### Create output directories safely. def create_dir(path):     if not os.path.exists(path):         os.makedirs(path)  ### Load dataset image and mask file lists and split them into training, validation, and test subsets. def load_dataset(path, split=0.1):     X = sorted(glob(os.path.join(path, "train", "images", "*.png")))     Y = sorted(glob(os.path.join(path, "train", "masks", "*.png")))      split_size = int(len(X) * split)      train_x, valid_x = train_test_split(X, test_size=split_size, random_state=42)     train_y, valid_y = train_test_split(Y, test_size=split_size, random_state=42)      test_x = sorted(glob(os.path.join(path, "test", "images", "*.png")))     test_y = sorted(glob(os.path.join(path, "test", "masks", "*.png")))      return (train_x, train_y), (valid_x, valid_y), (test_x, test_y)  ### Functions for reading images and masks, creating patches suitable for transformers. def read_image(path):     path = path.decode()     image = cv2.imread(path, cv2.IMREAD_COLOR)     image = cv2.resize(image, (cf["image_size"], cf["image_size"]))     image = image / 255.0      patch_shape = (cf["patch_size"], cf["patch_size"], cf["num_channels"])     patches = patchify(image, patch_shape, cf["patch_size"])     patches = np.reshape(patches, cf["flat_patches_shape"])     patches = patches.astype(np.float32)      return patches  def read_mask(path):     path = path.decode()     mask = cv2.imread(path, cv2.IMREAD_GRAYSCALE)     mask = cv2.resize(mask, (cf["image_size"], cf["image_size"]))     mask = mask / 255.0     mask = mask.astype(np.float32)     mask = np.expand_dims(mask, axis=-1)      return mask  def tf_parse(x, y):     def _parse(x, y):         x = read_image(x)         y = read_mask(y)         return x, y      x, y = tf.numpy_function(_parse, [x, y], [tf.float32, tf.float32])     x.set_shape(cf["flat_patches_shape"])     y.set_shape([cf["image_size"], cf["image_size"], 1])     return x, y  def tf_dataset(X, Y, batch=2):     ds = tf.data.Dataset.from_tensor_slices((X, Y))     ds = ds.map(tf_parse).batch(batch).prefetch(10)     return ds

Summary:
You now have a pipeline that reads images, converts them into transformer-friendly patches, and bundles them into TensorFlow datasets for training.The UNETR AdvantageWhy are we specifically focusing on Hair Segmentation using UNETR Transformers in Python instead of a standard U-Net? The answer lies in the ‘Global Context.’ Standard CNNs lose fine-grained detail through repeated pooling layers. By implementing Hair Segmentation using UNETR Transformers in Python, we maintain high-resolution feature maps throughout the bottleneck, which is essential for capturing the complex, non-linear boundaries of human hair that traditional models often smooth over.Training the UNETR Segmentation ModelNow that the dataset pipeline is ready, the next step is to actually train the UNETR model.
This part of the code seeds randomness for reproducibility, prepares output folders, defines hyperparameters, builds the UNETR model, compiles it with Dice loss, and starts the training loop with callbacks like checkpointing and early stopping.
The goal is to teach the model how to recognize hair pixels vs background pixels based on the training dataset.When preparing your hair dataset, it is vital to account for ‘class imbalance.’ Hair often occupies a small percentage of the total pixel count compared to the background. To ensure the model doesn’t become biased toward the background, we utilize a combination of Binary Cross-Entropy (BCE) and Dice Loss. This ensures that the fine edges of the hair are prioritized during the backpropagation process, leading to much sharper mask boundaries.Here is the full training script section:

### Ensure consistent results by setting NumPy and TensorFlow random seeds. if __name__ == "__main__":     ### Set NumPy seed.     np.random.seed(42)     ### Set TensorFlow seed.     tf.random.set_seed(42)      ### Create a directory to store trained model files and logs.     create_dir("D:/Temp/Models/Unet-Binray")      ### Define training hyperparameters.     batch_size = 8     lr = 0.1     num_epochs = 500     ### Define where the trained model and logs will be stored.     model_path = os.path.join("D:/Temp/Models/Unet-Binray", "model.keras")     csv_path = os.path.join("D:/Temp/Models/Unet-Binray", "log.csv")      ### Set dataset path.     dataset_path = "D:/Data-Sets-Object-Segmentation/figaro_hair_segmentation_1000"     ### Load dataset splits for training, validation, and testing.     (train_x, train_y), (valid_x, valid_y), (test_x, test_y) = load_dataset(dataset_path)      ### Build TensorFlow datasets from image and mask file lists.     train_dataset = tf_dataset(train_x, train_y, batch=batch_size)     valid_dataset = tf_dataset(valid_x, valid_y, batch=batchSize)      ### Build the UNETR model based on the configuration dictionary.     model = build_unetr_2d(cf)     ### Compile the model using Dice loss and SGD optimizer.     model.compile(loss=dice_loss, optimizer=SGD(lr))     ### Print model summary.     print(model.summary())      ### Define callbacks for saving best model, adjusting learning rate, logging training, and stopping early.     callbacks = [         ModelCheckpoint(model_path, verbose=1, save_best_only=True),         ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, min_lr=1e-7, verbose=1),         CSVLogger(csv_path),         EarlyStopping(monitor='val_loss', patience=20, restore_best_weights=False)     ]      ### Train the UNETR model using dataset pipelines and callbacks.     model.fit(         train_dataset,         epochs=num_epochs,         validation_data=valid_dataset,         callbacks=callbacks     )

Summary:
At the end of this step, you will have a trained UNETR model saved as model.keras, ready to make segmentation predictions.Custom Dice Metric and the UNETR Architecture DefinitionThe model relies on Dice loss, a segmentation-friendly metric that measures how well the predicted mask overlaps with the ground truth mask.
Alongside that, the UNETR architecture is defined using transformer encoder blocks on patch embeddings, combined with a U-Net-style decoder that reconstructs pixel-level predictions.Below are the supporting files: metrics.py and unetr_2d.py.Save the following code as metrics.py

### Import NumPy and TensorFlow for mathematical operations and tensors. import numpy as np import tensorflow as tf  ### Define a small constant to avoid division by zero. smooth = 1e-15  ### Define the Dice coefficient metric to measure segmentation overlap. def dice_coef(y_true, y_pred):     y_true = tf.keras.layers.Flatten()(y_true)     y_pred = tf.keras.layers.Flatten()(y_pred)     intersection = tf.reduce_sum(y_true * y_pred)     return (2. * intersection + smooth) / (tf.reduce_sum(y_true) + tf.reduce_sum(y_pred) + smooth)  ### Define Dice loss as one minus Dice coefficient. def dice_loss(y_true, y_pred):     return 1.0 - dice_coef(y_true, y_pred)

Save the following code as unetr_2d.py

### Import OS and math utilities. import os from math import log2 ### Import TensorFlow and Keras layers. import tensorflow as tf import tensorflow.keras.layers as L from tensorflow.keras.models import Model  ### Define MLP feedforward block. def mlp(x, cf):      x = L.Dense(cf["mlp_dim"], activation="gelu")(x)     x = L.Dropout(cf["dropout_rate"])(x)     x = L.Dense(cf["hidden_dim"])(x)     x = L.Dropout(cf["dropout_rate"])(x)     return x  ### Define transformer encoder block using attention and residual connections. def transformer_encoder(x, cf):     skip_1 = x     x = L.LayerNormalization()(x)     x = L.MultiHeadAttention(         num_heads=cf["num_heads"], key_dim=cf["hidden_dim"]     )(x, x)     x = L.Add()([x, skip_1])      skip_2 = x     x = L.LayerNormalization()(x)     x = mlp(x, cf)     x = L.Add()([x, skip_2])      return x  ### Define convolution and deconvolution helper blocks. def conv_block(x, num_filters, kernel_size=3):     x = L.Conv2D(num_filters, kernel_size=kernel_size, padding="same")(x)     x = L.BatchNormalization()(x)     x = L.ReLU()(x)     return x  def deconv_block(x, num_filters, strides=2):     x = L.Conv2DTranspose(num_filters, kernel_size=2, padding="same", strides=strides)(x)     return x  ### Build the complete UNETR model. def build_unetr_2d(cf):     input_shape = (cf["num_patches"], cf["patch_size"]*cf["patch_size"]*cf["num_channels"])     inputs = L.Input(input_shape)      patch_embed = L.Dense(cf["hidden_dim"])(inputs)      positions = tf.range(start=0, limit=cf["num_patches"], delta=1)     pos_embed = L.Embedding(input_dim=cf["num_patches"], output_dim=cf["hidden_dim"])(positions)     x = patch_embed + pos_embed      skip_connection_index = [3, 6, 9, 12]     skip_connections = []      for i in range(1, cf["num_layers"]+1, 1):         x = transformer_encoder(x, cf)         if i in skip_connection_index:             skip_connections.append(x)      z3, z6, z9, z12 = skip_connections     z0 = L.Reshape((cf["image_size"], cf["image_size"], cf["num_channels"]))(inputs)      shape = (         cf["image_size"]//cf["patch_size"],         cf["image_size"]//cf["patch_size"],         cf["hidden_dim"]     )     z3 = L.Reshape(shape)(z3)     z6 = L.Reshape(shape)(z6)     z9 = L.Reshape(shape)(z9)     z12 = L.Reshape(shape)(z12)      total_upscale_factor = int(log2(cf["patch_size"]))     upscale = total_upscale_factor - 4      if upscale >= 2:         z3 = deconv_block(z3, z3.shape[-1], strides=2**upscale)         z6 = deconv_block(z6, z6.shape[-1], strides=2**upscale)         z9 = deconv_block(z9, z9.shape[-1], strides=2**upscale)         z12 = deconv_block(z12, z12.shape[-1], strides=2**upscale)      if upscale < 0:         p = 2**abs(upscale)         z3 = L.MaxPool2D((p, p))(z3)         z6 = L.MaxPool2D((p, p))(z6)         z9 = L.MaxPool2D((p, p))(z9)         z12 = L.MaxPool2D((p, p))(z12)      x = deconv_block(z12, 128)     s = deconv_block(z9, 128)     s = conv_block(s, 128)     x = L.Concatenate()([x, s])     x = conv_block(x, 128)     x = conv_block(x, 128)      x = deconv_block(x, 64)     s = deconv_block(z6, 64)     s = conv_block(s, 64)     s = deconv_block(s, 64)     s = conv_block(s, 64)     x = L.Concatenate()([x, s])     x = conv_block(x, 64)     x = conv_block(x, 64)      x = deconv_block(x, 32)     s = deconv_block(z3, 32)     s = conv_block(s, 32)     s = deconv_block(s, 32)     s = conv_block(s, 32)     s = deconv_block(s, 32)     s = conv_block(s, 32)     x = L.Concatenate()([x, s])     x = conv_block(x, 32)     x = conv_block(x, 32)      x = deconv_block(x, 16)     s = conv_block(z0, 16)     s = conv_block(s, 16)     x = L.Concatenate()([x, s])     x = conv_block(x, 16)     x = conv_block(x, 16)      outputs = L.Conv2D(1, kernel_size=1, padding="same", activation="sigmoid")(x)     return Model(inputs, outputs, name="UNETR_2D")

Summary:
These scripts define the UNETR architecture and scoring metric that make hair segmentation accurate and transformer-powered.Testing the Trained UNETR Model on New ImagesOnce the model is trained, the final script loads the best saved checkpoint and runs predictions on the test dataset.
Each image is converted back to patch format, fed into the transformer encoder, reconstructed into a full mask, and saved side-by-side with the original and ground truth for visual comparison.Here is the testing code:

### Import core libraries for filesystem, math, and image processing. import os import numpy as np import cv2 import pandas as pd from glob import glob from tqdm import tqdm import tensorflow as tf from patchify import patchify ### Import dataset loader and helper. from Step2TrainUnetRModel import load_dataset, create_dir from metrics import dice_loss  ### Define the UNETR configuration. cf = {} cf["image_size"] = 256 cf["num_channels"] = 3 cf["num_layers"] = 12 cf["hidden_dim"] = 128 cf["mlp_dim"] = 32 cf["num_heads"] = 6 cf["dropout_rate"] = 0.1 cf["patch_size"] = 16 cf["num_patches"] = (cf["image_size"]**2)//(cf["patch_size"]**2) cf["flat_patches_shape"] = (     cf["num_patches"],     cf["patch_size"]*cf["patch_size"]*cf["num_channels"] )  ### Main testing logic. if __name__ == "__main__":     ### Fix random seeds.     np.random.seed(42)     tf.random.set_seed(42)      ### Create folder to store results.     resultsFolder = "D:/Temp/Models/Unet-Binray/results"     create_dir(resultsFolder)      ### Load best saved model.     model_path = os.path.join("D:/Temp/Models/Unet-Binray", "model.keras")     model = tf.keras.models.load_model(model_path, custom_objects={"dice_loss": dice_loss})      ### Load dataset.     dataset_path = "D:/Data-Sets-Object-Segmentation/figaro_hair_segmentation_1000"     (train_x, train_y), (valid_x, valid_y), (test_x, test_y) = load_dataset(dataset_path)      print(f"Train: \t{len(train_x)} - {len(train_y)}")     print(f"Valid: \t{len(valid_x)} - {len(valid_y)}")     print(f"Test: \t{len(test_x)} - {len(test_y)}")      ### Loop through test images and predict masks.     for x, y in tqdm(zip(test_x, test_y), total=len(test_x)):         name = os.path.basename(x)         print(name)          image = cv2.imread(x, cv2.IMREAD_COLOR)         image = cv2.resize(image, (cf["image_size"], cf["image_size"]))         x = image / 255.0          patch_shape = (cf["patch_size"], cf["patch_size"], cf["num_channels"])         patches = patchify(x, patch_shape, cf["patch_size"])         patches = np.reshape(patches, cf["flat_patches_shape"])         patches = patches.astype(np.float32)         patches = np.expand_dims(patches, axis=0)          mask = cv2.imread(y, cv2.IMREAD_GRAYSCALE)         mask = cv2.resize(mask, (cf["image_size"], cf["image_size"]))         mask = mask / 255.0         mask = np.expand_dims(mask, axis=-1)         mask = np.concatenate([mask, mask, mask], axis=-1)          pred = model.predict(patches, verbose=0)[0]         pred = np.concatenate([pred, pred, pred], axis=-1)          line = np.ones((cf["image_size"], 10, 3)) * 255         cat_images = np.concatenate([image, line, mask*255, line, pred*255], axis=1)         save_image_path = os.path.join(resultsFolder,  name)         cv2.imwrite(save_image_path, cat_images)                  cat_images_for_display = cat_images.astype(np.uint8)         cv2.imshow("Result", cv2.cvtColor(cat_images_for_display, cv2.COLOR_RGB2BGR) )         cv2.waitKey(1)      cv2.destroyAllWindows()

Summary:
You now visually compare original images, ground-truth hair masks, and UNETR predictions — confirming how well unetr image segmentation works in practice.To further improve these results in a production environment, consider applying a ‘Conditional Random Field’ (CRF) as a post-processing step. While the UNETR model provides an excellent initial mask, a CRF can help ‘snap’ the edges of the segmentation to the actual color boundaries of the image, further refining those difficult-to-capture wisps of hair that often challenge even the most advanced Transformer models.Performance OptimizationWhen evaluating the loss curves, you may notice that Hair Segmentation using UNETR Transformers in Python requires a slightly longer warm-up period compared to pure convolutional models. This is because the Transformer encoder needs to ‘learn’ the spatial relationships from scratch. However, once converged, the accuracy of Hair Segmentation using UNETR Transformers in Python typically surpasses CNN-based benchmarks in Mean IoU (Intersection over Union) for high-detail masking tasks.

Explore More Computer Vision Projects

Classical Image Segmentation with OpenCV
This tutorial explains segmentation fundamentals before applying transformers.
Person Segmentation with U-Net
A complete walkthrough for semantic segmentation using U-Net in TensorFlow.

FAQ — UNETR Image Segmentation

What is UNETR image segmentation?

It is a segmentation method that combines transformer encoders with a U-Net style decoder for pixel-level classification.

Why use transformers for image segmentation?

Transformers learn global relationships across the whole image, improving segmentation quality.

What dataset is used here?

The Figaro hair segmentation dataset is used for training and testing.

What is Dice loss?

Dice loss measures overlap between predicted and true segmentation masks, making it ideal for segmentation tasks.

Can the model run on CPU?

Yes, but GPU hardware is recommended for faster training performance.

ConclusionIn this guide, we have covered the end-to-end pipeline for performing Hair Segmentation using UNETR Transformers in Python, from data preparation to model inference. As AI continues to evolve, the shift toward Transformer-based architectures is becoming the industry standard. Master the techniques used in Hair Segmentation using UNETR Transformers in Python today to stay ahead of the curve in computer vision and semantic segmentation. For more Python-based AI tutorials, be sure to explore my other deep learning guides.The code shown here is designed to be practical and reproducible. You install the right environment, prepare your dataset carefully, define your model using modern deep learning techniques, and evaluate results visually.
Whether you are new to segmentation or already experienced in deep learning, UNETR gives you a state-of-the-art way to work with pixel-based predictions in a structured and scalable workflow.Connect :☕ Buy me a coffee — https://ko-fi.com/eranfeit🖥️ Email : feitgemel@gmail.com🌐 https://eranfeit.net🤝 Fiverr : https://www.fiverr.com/s/mB3PbbEnjoy,Eran

← Previous Post

Subscribe to Our Newsletter

Enter your email to receive new insights, tutorials, and project updates directly in your inbox.

Email

The form has been submitted successfully!

There has been some error while submitting the form. Please verify all form fields again.

Copyright © 2026 Eran Feit

Powered by Eran Feit

Home
My blog post
Image Classification
Object Detection
Image Segmentation
Unet
OpenCV
Python Cool Stuff
Jetson Nano
TensorFlow tutorials
Travel
Contact
HTML Sitemap