FasterViT Image Classification Using Custom Dataset | Star wars dataset

/ VIT, Image Classification, Pytorch

Contents hide

1 🧠 Introduction — FasterViT Image Classification Using Custom Dataset

2 🌟 What Is FasterViT Image Classification and Why Custom Data Matters?

3 The FasterViT architecture

4 Building a Practical FasterViT Image Classification Tutorial with a Custom Dataset

5 FasterViT Image Classification Using Custom Dataset

6 Setting Up the Environment

7 Downloading and Understanding the Custom Dataset

8 Preparing Your Dataset for Image Classification

9 Training the FasterViT Model

10 Testing Your Trained FasterViT Model

11.1 What is FasterViT image classification?

11.2 Why split the dataset into train, val, and test?

11.3 What does the scheduler do in training?

11.4 Why normalize images before training?

11.5 Do I need GPU for this tutorial?

11.6 Can I reuse the trained model for other datasets?

11.7 What library provides FasterViT?

11.8 How do I visualize predictions?

11.9 What is the main loss function used?

Last Updated on 03/03/2026 by Eran Feit

🧠 Introduction — FasterViT Image Classification Using Custom Dataset

FasterViT image classification using custom dataset represents a modern, efficient approach to training deep learning models that can recognize and categorize images from your own tailored collection of visual data. In a world where off-the-shelf datasets often don’t match specific application needs, applying models like FasterViT to a custom dataset — such as a curated set of Star Wars character images — allows developers to build computer vision systems that reflect real-world use cases and unique classification requirements. This customization is particularly powerful for niche tasks that require specialized image recognition beyond generic object categories, making FasterViT an exciting choice for advanced computer vision practitioners.

At its core, FasterViT combines the strengths of traditional convolutional neural networks (CNNs), which excel at capturing detailed local features, with the global context modeling of vision transformers. This hybrid architecture enables the model to learn both fine-grained texture patterns and broad relational information across an image, leading to robust feature representations. When applied to a custom dataset, better feature extraction and classification accuracy can be achieved compared to using either a pure CNN or a pure transformer model alone.

Training FasterViT on a custom dataset introduces challenges and opportunities. Developers must carefully prepare the data — organizing training, validation, and testing splits, and ensuring proper image preprocessing — so the model can effectively generalize from limited or imbalanced samples. With proper dataset preparation and training strategies, custom FasterViT models can outperform many conventional deep learning classifiers, especially in tasks that require distinguishing fine differences between similar image classes.

Finally, integrating FasterViT into real-world applications demonstrates its practical value. Whether building a character recognizer from a Star Wars image set, designing a wildlife species classifier, or developing a custom industrial defect detector, FasterViT’s ability to leverage custom datasets makes it a highly adaptable tool in the computer vision ecosystem. As deep learning continues to evolve, models like FasterViT that blend efficiency with performance are key for developers who need both speed and accuracy in specialized image classification tasks.

🌟 What Is FasterViT Image Classification and Why Custom Data Matters?

FasterViT image classification is a deep learning approach that blends convolutional neural networks with vision transformer elements to effectively classify images. Unlike traditional models that focus purely on either local patterns (CNNs) or global relationships (transformers), FasterViT uses a hybrid architecture that takes advantage of the best of both worlds. This makes it especially suitable for image classification tasks where details matter and large-scale context influences predictions, such as distinguishing between visually similar characters or objects.

When working with custom datasets — like a unique Star Wars image collection — conventional models often fall short due to limited examples or insufficient context. Custom datasets reflect real-world problems these models need to solve, such as identifying specific character traits or unusual visual features unseen in benchmark datasets. Training FasterViT on these tailored images gives the model exposure to exactly the visual domains it will encounter during inference, improving accuracy and robustness.

Moreover, FasterViT’s hierarchical attention mechanism enables efficient processing of visual features at different scales. Early layers might focus on small, local details in images — edges, textures, or character accessories — while later transformer-like layers capture broader patterns or global relationships across the entire image. This layered attention system supports the model’s ability to generalize from a custom dataset to unseen test images, making it well-suited for specialized vision tasks that vary significantly from popular benchmark collections.

In practical application, this means that when you feed FasterViT a custom dataset of labeled images, the model can learn both subtle and global cues needed to make reliable classifications. Whether your goal is to recognize specific individuals, object categories, or nuanced visual differences dictated by your dataset’s uniqueness, FasterViT’s flexible architecture enables innovative solutions. By training this model on your curated images, you customize the learning process — and the resulting classifier becomes tailored, higher performing, and more relevant to your specific needs.

The FasterViT architecture

FasterViT

The FasterViT architecture is built as a hybrid model that combines the strengths of convolutional neural networks and vision transformers in a single unified pipeline. The network begins with traditional convolutional layers, which downsample the input image and extract low-level visual features such as edges, textures, and simple spatial patterns. These convolution stages are efficient and computationally light, making them ideal for early processing where fine-grained image structure matters most. By progressively reducing the spatial resolution while increasing channel depth, the model prepares compact yet informative feature maps for the next stages.

After the initial convolution blocks, FasterViT transitions into deeper stages that incorporate hierarchical attention. This is where the transformer-based components come into play. Hierarchical attention allows the model to capture long-range dependencies in the image — understanding how different regions relate to one another, even when they are far apart spatially. This global reasoning is what makes transformer-based architectures particularly powerful for image understanding, as the model is no longer limited to only local receptive fields like a CNN. FasterViT carefully balances attention computation so it remains efficient while still modeling complex contextual relationships.

In the later stages, the model continues to alternate between downsampling and attention-driven processing, building increasingly abstract feature representations. By the time the data reaches the classification head, the network has learned both detailed local information and high-level contextual structure. This combination enables FasterViT to achieve strong accuracy on image classification tasks while remaining computationally efficient compared to pure transformer models. The architecture is therefore especially useful for real-world applications where both performance and speed matter, making it a compelling evolution in the vision transformer family.

FasterViT animal classification process

Building a Practical FasterViT Image Classification Tutorial with a Custom Dataset

This tutorial walks through the full process of using FasterViT image classification on a custom dataset, showing how to install the required libraries, prepare the dataset, train the model, and finally test it on new images. The goal is to give you a complete, working pipeline so you can adapt the same approach to any dataset you choose — whether that’s animals, products, vehicles, or any type of labeled image collection. Everything is designed to be hands-on and code-driven so you can follow along step-by-step.

The core idea behind the tutorial is to demonstrate how FasterViT — a hybrid model that combines convolutional layers with transformer-based attention — can be trained on real-world images rather than relying only on standard benchmark datasets. You’ll see how the dataset is split into training, validation, and test sets, how transformations are applied, and how the model learns to identify each class. By the end, you’ll understand not only how the code works, but also why each stage of the workflow is important.

Another key part of the tutorial is modifying the model’s final classification head to match the number of classes in your dataset. This ensures FasterViT can correctly predict the labels you define. The training loop also tracks accuracy and loss over time, so you can monitor how well the model is learning and automatically keep the best-performing weights. This makes the process robust and beginner-friendly, while still being powerful enough for advanced experimentation.

Finally, the tutorial shows how to load your trained model and run predictions on new images. You’ll see how to preprocess test images, send them through the model, and display the predicted class — even overlaying the result on the image itself. This allows you to move from raw data → trained model → working classifier, creating a complete solution that you can reuse and scale for future projects.

Link to the video tutorial : https://youtu.be/n-SpVoHrzDQ

You can download the code for the tutorial here :https://eranfeit.lemonsqueezy.com/checkout/buy/a6159108-c66c-4e21-80e0-7a6589f0b8b0 or here : https://ko-fi.com/s/28ca45253c

Link for to the post for Medium users : https://medium.com/vision-transformers-tutorials/fastervit-image-classification-using-custom-dataset-star-wars-dataset-8e6ce470d566

You can follow my blog here : https://eranfeit.net/blog/

Want to get started with Computer Vision or take your skills to the next level ?

Great Interactive Course : “Deep Learning for Images with PyTorch” here : https://datacamp.pxf.io/zxWxnm

If you’re just beginning, I recommend this step-by-step course designed to introduce you to the foundations of Computer Vision – Complete Computer Vision Bootcamp With PyTorch & TensorFlow

If you’re already experienced and looking for more advanced techniques, check out this deep-dive course – Modern Computer Vision GPT, PyTorch, Keras, OpenCV4

FasterViT Image Classification Using Custom Dataset

FasterViT is a powerful hybrid model that combines the feature extraction strengths of convolutional neural networks with the global contextual understanding of vision transformers.
In this post, you’ll build a complete image classification pipeline using FasterViT on your own custom dataset — the Star Wars characters dataset.
Each part breaks down the code into digestible steps with clear explanations so you can follow along and adapt it to any dataset.

The goal is practical mastery: from environment setup, dataset preparation, training, and testing, you will walk through a real-world workflow from top to bottom.
Think of this as a roadmap you can reuse to train FasterViT models on any custom image classification task, using PyTorch and FasterVit libraries.

Setting Up the Environment

Before training FasterViT on your custom dataset, your system must be ready.
This section creates a dedicated Conda environment and installs PyTorch with GPU support, along with FasterVit and necessary Python packages.
Isolating dependencies in a new environment prevents version conflicts and ensures compatibility for deep learning workflows.

You’ll also check the CUDA version to ensure GPU acceleration is available, which significantly speeds up training.
The specific package versions used here are selected for stability and reproducibility so you can train efficiently without unexpected errors.

# Create a new Conda environment named "fasterVit" with Python 3.11
conda create -n fasterVit python=3.11

# Activate the newly created "fasterVit" environment
conda activate fasterVit 

# Check the installed CUDA version on your system
nvcc --version

# Install PyTorch 2.5.0, Torchvision, Torchaudio, and CUDA 12.4 support from PyTorch and NVIDIA channels
conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.4 -c pytorch -c nvidia

# Install the FasterVit library version 0.9.8 from PyPI
pip install fastervit==0.9.8

# Install timm (PyTorch Image Models) version 0.9.12 for model utilities and backbones
pip install timm==0.9.12

# Install matplotlib for plotting and visualization
pip install matplotlib

# Install OpenCV Python bindings (note: command is missing the 'install' verb as written)
pip opencv-python==4.10.0.84

# Create a new Conda environment named "fasterVit" with Python 3.11 conda create -n fasterVit python=3.11  # Activate the newly created "fasterVit" environment conda activate fasterVit   # Check the installed CUDA version on your system nvcc --version  # Install PyTorch 2.5.0, Torchvision, Torchaudio, and CUDA 12.4 support from PyTorch and NVIDIA channels conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.4 -c pytorch -c nvidia  # Install the FasterVit library version 0.9.8 from PyPI pip install fastervit==0.9.8  # Install timm (PyTorch Image Models) version 0.9.12 for model utilities and backbones pip install timm==0.9.12  # Install matplotlib for plotting and visualization pip install matplotlib  # Install OpenCV Python bindings (note: command is missing the 'install' verb as written) pip opencv-python==4.10.0.84

This environment gives you all the tools needed to implement FasterViT training and testing without friction.
Now, you can focus on the dataset and model logic rather than debugging library conflicts.

Downloading and Understanding the Custom Dataset

Before we start training the FasterViT image classification model, we first need a labeled image dataset. In this tutorial, we are working with a collection of character images that are already grouped into folders by class. Each folder represents one category, and every image inside belongs to that specific label. This structure is very important because PyTorch relies on the folder layout to automatically map images to class names during training.

The dataset contains multiple characters, and each character has a number of image samples from different angles, lighting conditions, and backgrounds. This variation helps the model generalize and learn what truly defines each class, rather than memorizing a single image pattern. If your dataset includes clear, centered subjects and consistent labeling, you will normally achieve stronger and more reliable classification results.

Want the exact dataset so your results match mine?

If you want to reproduce the same training flow and compare your results to mine, I can share the dataset structure and what I used in this tutorial. Send me an email and mention the name of the tutorial / dataset , so I know what you’re requesting.

🖥️ Email: feitgemel@gmail.com

D:/Data-Sets-Image-Classification/Star-Wars-Characters

D:/Data-Sets-Image-Classification/Star-Wars-Characters

Inside this folder, each sub-folder represents a different class. For example:

Star-Wars-Characters/
  ├── Class_1/
  ├── Class_2/
  ├── Class_3/
  ├── ...

Star-Wars-Characters/   ├── Class_1/   ├── Class_2/   ├── Class_3/   ├── ...

This means you do not need a CSV file or manual labels — the folder names themselves act as the class labels. Later, our script automatically splits these images into Train, Validation, and Test folders while preserving the class-based structure. If you ever decide to swap this dataset with your own images, just keep the same folder-per-class layout and the rest of the code will continue to work smoothly.

Preparing Your Dataset for Image Classification

The next step is organizing your custom dataset into training, validation, and testing splits.
This allows the model to learn from training examples, tune performance on validation data, and evaluate generalization on the test set.

This code creates appropriate folder structures, randomly shuffles images, and distributes them into the right folders.
A balanced dataset structure ensures that training and validation samples represent all classes evenly.

# Import the os module for filesystem path and directory operations
import os

# Import shutil to copy files and manage file operations
import shutil

# Import random to shuffle image lists before splitting into sets
import random

# Define a helper function to create Train, Val, and Test folders for each category
def create_folders(base_path, categories):
    # Loop over every detected category name
    for category in categories:
        # Create the Train subfolder for the current category (if it doesn't already exist)
        os.makedirs(os.path.join(base_path, 'Train', category), exist_ok=True)
        # Create the Val subfolder for the current category (if it doesn't already exist)
        os.makedirs(os.path.join(base_path, 'Val', category), exist_ok=True)
        # Create the Test subfolder for the current category (if it doesn't already exist)
        os.makedirs(os.path.join(base_path, 'Test', category), exist_ok=True)

# Define a function to split data into train, validation, and test subsets
def split_data(source_folder, dest_folder, train_ratio=0.7, validate_ratio=0.2):
    # Get the list of subfolders (categories) inside the source folder
    categories = [d for d in os.listdir(source_folder) if os.path.isdir(os.path.join(source_folder, d))]
    # Ensure all required destination folders (Train/Val/Test per category) exist
    create_folders(dest_folder, categories)
    
    # Iterate over each category to process its images
    for category in categories:
        # Build the full path to the current category directory
        category_path = os.path.join(source_folder, category)
        # List all image files within this category directory
        images = [f for f in os.listdir(category_path) if os.path.isfile(os.path.join(category_path, f))]
        # Randomly shuffle the images to avoid ordering bias
        random.shuffle(images)
        
        # Calculate the index at which the training subset ends
        train_split = int(len(images) * train_ratio)
        # Calculate the index at which the validation subset ends
        validate_split = int(len(images) * (train_ratio + validate_ratio))
        
        # Select the training images based on the first split
        train_images = images[:train_split]
        # Select the validation images between train_split and validate_split
        validate_images = images[train_split:validate_split]
        # The remaining images belong to the test set
        test_images = images[validate_split:]
        
        # Copy each training image into the corresponding Train/category folder
        for image in train_images:
            shutil.copy(os.path.join(category_path, image), os.path.join(dest_folder, 'Train', category, image))
        
        # Copy each validation image into the corresponding Val/category folder
        for image in validate_images:
            shutil.copy(os.path.join(category_path, image), os.path.join(dest_folder, 'Val', category, image))
        
        # Copy each test image into the corresponding Test/category folder
        for image in test_images:
            shutil.copy(os.path.join(category_path, image), os.path.join(dest_folder, 'Test', category, image))

# Define the original dataset folder containing the class subfolders
source_folder = 'D:/Data-Sets-Image-Classification/Star-Wars-Characters'
# Define the destination folder where the Train/Val/Test structure will be created
dest_folder = 'D:/Data-Sets-Image-Classification/Star-Wars-Characters-For-Classification'
# Call the split_data function to perform the splitting operation
split_data(source_folder, dest_folder)

 # Import the os module for filesystem path and directory operations import os  # Import shutil to copy files and manage file operations import shutil  # Import random to shuffle image lists before splitting into sets import random  # Define a helper function to create Train, Val, and Test folders for each category def create_folders(base_path, categories):     # Loop over every detected category name     for category in categories:         # Create the Train subfolder for the current category (if it doesn't already exist)         os.makedirs(os.path.join(base_path, 'Train', category), exist_ok=True)         # Create the Val subfolder for the current category (if it doesn't already exist)         os.makedirs(os.path.join(base_path, 'Val', category), exist_ok=True)         # Create the Test subfolder for the current category (if it doesn't already exist)         os.makedirs(os.path.join(base_path, 'Test', category), exist_ok=True)  # Define a function to split data into train, validation, and test subsets def split_data(source_folder, dest_folder, train_ratio=0.7, validate_ratio=0.2):     # Get the list of subfolders (categories) inside the source folder     categories = [d for d in os.listdir(source_folder) if os.path.isdir(os.path.join(source_folder, d))]     # Ensure all required destination folders (Train/Val/Test per category) exist     create_folders(dest_folder, categories)          # Iterate over each category to process its images     for category in categories:         # Build the full path to the current category directory         category_path = os.path.join(source_folder, category)         # List all image files within this category directory         images = [f for f in os.listdir(category_path) if os.path.isfile(os.path.join(category_path, f))]         # Randomly shuffle the images to avoid ordering bias         random.shuffle(images)                  # Calculate the index at which the training subset ends         train_split = int(len(images) * train_ratio)         # Calculate the index at which the validation subset ends         validate_split = int(len(images) * (train_ratio + validate_ratio))                  # Select the training images based on the first split         train_images = images[:train_split]         # Select the validation images between train_split and validate_split         validate_images = images[train_split:validate_split]         # The remaining images belong to the test set         test_images = images[validate_split:]                  # Copy each training image into the corresponding Train/category folder         for image in train_images:             shutil.copy(os.path.join(category_path, image), os.path.join(dest_folder, 'Train', category, image))                  # Copy each validation image into the corresponding Val/category folder         for image in validate_images:             shutil.copy(os.path.join(category_path, image), os.path.join(dest_folder, 'Val', category, image))                  # Copy each test image into the corresponding Test/category folder         for image in test_images:             shutil.copy(os.path.join(category_path, image), os.path.join(dest_folder, 'Test', category, image))  # Define the original dataset folder containing the class subfolders source_folder = 'D:/Data-Sets-Image-Classification/Star-Wars-Characters' # Define the destination folder where the Train/Val/Test structure will be created dest_folder = 'D:/Data-Sets-Image-Classification/Star-Wars-Characters-For-Classification' # Call the split_data function to perform the splitting operation split_data(source_folder, dest_folder)

With your images split into folders, the model can now iterate over them in training and validation loops.
This structure is compatible with PyTorch’s dataset utilities, making the next part seamless.

Training the FasterViT Model

Now comes the heart of the pipeline: training the FasterViT model on your custom dataset.
The training function handles multiple epochs, computes loss and accuracy, and saves the best model weights based on validation performance.

This code uses standard PyTorch structures like data loaders, optimizers, schedulers, and loss functions.
It ensures your model trains efficiently and tracks progress over time.

# Import os to handle file paths and directory operations
import os 

# Import the core PyTorch library
import torch

# Import torchvision datasets and transforms for image loading and preprocessing
from torchvision import datasets, transforms

# Import DataLoader to batch and iterate over datasets
from torch.utils.data import DataLoader

# Import create_model from fastervit to construct the FasterViT architecture
from fastervit import create_model

# Import PyTorch's optimization module
import torch.optim as optim

# Import learning rate scheduler utilities
from torch.optim import lr_scheduler

# Import time to measure training duration
import time 

# Import copy to deep copy model weights when tracking the best model
import copy



# Define a training loop function for the model
def train_model(model , criterion , optimizer , scheduler , num_epochs):
    # Record the starting time of the training process
    since = time.time()

    # Make a deep copy of the model's initial weights to store the best version
    best_model_wts = copy.deepcopy(model.state_dict())
    # Initialize the best accuracy with zero
    best_acc = 0.0

    # Loop over each epoch in the training process
    for epoch in range(num_epochs):
        # Print the current epoch index and total epochs
        print(f'Epoch {epoch}/{num_epochs - 1}')
        # Print a visual separator line
        print('-' * 10)

        # Each epoch has both a training phase and a validation phase
        for phase in ['train', 'val']:
            # Set the model to training mode during the train phase
            if phase == 'train':
                model.train()
            # Set the model to evaluation mode during the validation phase
            else :
                model.eval()

            # Initialize running loss for the epoch
            running_loss = 0.0
            # Initialize running correct predictions for the epoch
            running_corrects = 0

            # Iterate over batches from the dataloader of the current phase
            for inputs, labels in dataloaders[phase]:
                # Move input images to the selected device (CPU or GPU)
                inputs = inputs.to(device)
                # Move labels to the selected device
                labels = labels.to(device)

                # Reset gradients of the optimizer at the start of each batch
                optimizer.zero_grad()

                # Enable gradient computation only when in training phase
                with torch.set_grad_enabled(phase == 'train'):
                    # Perform a forward pass through the model to get outputs
                    outputs = model(inputs)
                    # Get the predicted class indices by taking the max logit
                    _, preds = torch.max(outputs, 1)
                    # Compute the loss between model outputs and true labels
                    loss = criterion(outputs, labels)

                    # If in training phase, perform backpropagation and optimizer step
                    if phase == 'train':
                        # Backpropagate the loss
                        loss.backward()
                        # Update model parameters
                        optimizer.step()

                # Accumulate the batch loss scaled by the batch size
                running_loss += loss.item() * inputs.size(0)
                # Accumulate the number of correct predictions
                running_corrects += torch.sum(preds == labels.data)

            # Step the learning rate scheduler after finishing the training phase
            if phase == 'train':
                scheduler.step()

            # Compute the epoch loss by dividing total loss by dataset size
            epoch_loss = running_loss / dataset_sizes[phase]
            # Compute the epoch accuracy as corrects divided by dataset size
            epoch_acc = running_corrects.double() / dataset_sizes[phase]

            # Print the loss and accuracy for this phase
            print(f'{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')

            # If we are in validation phase and this accuracy is the best so far
            if phase == 'val' and epoch_acc > best_acc:
                # Update the best accuracy value
                best_acc = epoch_acc
                # Save the current model weights as the best model
                best_model_wts = copy.deepcopy(model.state_dict())

        # Print a blank line for better readability between epochs
        print()

    # Compute total training time in seconds
    time_elapsed = time.time() - since
    # Print the total training time in minutes and seconds
    print(f'Training complete in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s')
    # Print the best validation accuracy achieved during training
    print(f'Best val Acc: {best_acc:4f}')

    # Load the best model weights back into the model
    model.load_state_dict(best_model_wts)

    # Return the best model after training
    return model


# Use the main guard to ensure code only runs when this script is executed directly
if __name__ == "__main__":
    # Set the path to the prepared Train/Val dataset directory
    data_dir = "D:/Data-Sets-Image-Classification/Star-Wars-Characters-For-Classification"

    # Define image transformations for training and validation datasets
    datatrasforms = {
        # Training data augmentation and normalization pipeline
        'train': transforms.Compose([
            # Resize the shortest side of the image to 256 pixels
            transforms.Resize(256),
            # Randomly crop a 224x224 patch from the image
            transforms.RandomResizedCrop(224),
            # Randomly flip the image horizontally for augmentation
            transforms.RandomHorizontalFlip(),
            # Convert the image to a PyTorch tensor
            transforms.ToTensor(),
            # Normalize the image with ImageNet mean and std
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ]),
        # Validation data preprocessing and normalization pipeline
        'val': transforms.Compose([
            # Resize the shortest side of the image to 256 pixels
            transforms.Resize(256),
            # Take a centered 224x224 crop from the image
            transforms.CenterCrop(224),
            # Convert the image to a PyTorch tensor
            transforms.ToTensor(),
            # Normalize the image with ImageNet mean and std
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ]),
    }

    # Create ImageFolder datasets for train and val from the directory structure
    image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x), datatrasforms[x]) for x in ['train', 'val']}
    # Wrap datasets with DataLoaders for batching and shuffling
    dataloaders = {x: DataLoader(image_datasets[x], batch_size=32, shuffle=True, num_workers=4) for x in ['train', 'val']}
    # Get dataset sizes for calculating loss and accuracy
    dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}
    # Extract class names from the training dataset folder structure
    class_names = image_datasets['train'].classes

    # Choose GPU if available; otherwise, fall back to CPU
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    # Create a FasterViT model instance with pretrained weights loaded from given path
    model = create_model('faster_vit_0_224', pretrained=True, model_path="d:/temp/models/faster_vit_0.pth.tar")

    # Get the number of features from the model head
    num_ftrs = model.head.in_features
    # Replace the final classification layer with a new Linear layer for our number of classes
    model.head = torch.nn.Linear(num_ftrs, len(class_names))

    # Move the model to the selected device (GPU or CPU)
    model = model.to(device)
    # Define the cross-entropy loss function for multi-class classification
    criterion = torch.nn.CrossEntropyLoss()
    # Create an SGD optimizer with a learning rate and momentum
    optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    # Setup a StepLR scheduler to reduce LR every 7 epochs by a factor of 0.1
    scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)

    # Train the model using the train_model function for 100 epochs
    model = train_model(model, criterion, optimizer, scheduler, num_epochs=100)
    # Save the trained model weights to disk
    torch.save(model.state_dict(), 'd:/temp/models/star_wars_faster_vit_model.pth')

# Import os to handle file paths and directory operations import os   # Import the core PyTorch library import torch  # Import torchvision datasets and transforms for image loading and preprocessing from torchvision import datasets, transforms  # Import DataLoader to batch and iterate over datasets from torch.utils.data import DataLoader  # Import create_model from fastervit to construct the FasterViT architecture from fastervit import create_model  # Import PyTorch's optimization module import torch.optim as optim  # Import learning rate scheduler utilities from torch.optim import lr_scheduler  # Import time to measure training duration import time   # Import copy to deep copy model weights when tracking the best model import copy    # Define a training loop function for the model def train_model(model , criterion , optimizer , scheduler , num_epochs):     # Record the starting time of the training process     since = time.time()      # Make a deep copy of the model's initial weights to store the best version     best_model_wts = copy.deepcopy(model.state_dict())     # Initialize the best accuracy with zero     best_acc = 0.0      # Loop over each epoch in the training process     for epoch in range(num_epochs):         # Print the current epoch index and total epochs         print(f'Epoch {epoch}/{num_epochs - 1}')         # Print a visual separator line         print('-' * 10)          # Each epoch has both a training phase and a validation phase         for phase in ['train', 'val']:             # Set the model to training mode during the train phase             if phase == 'train':                 model.train()             # Set the model to evaluation mode during the validation phase             else :                 model.eval()              # Initialize running loss for the epoch             running_loss = 0.0             # Initialize running correct predictions for the epoch             running_corrects = 0              # Iterate over batches from the dataloader of the current phase             for inputs, labels in dataloaders[phase]:                 # Move input images to the selected device (CPU or GPU)                 inputs = inputs.to(device)                 # Move labels to the selected device                 labels = labels.to(device)                  # Reset gradients of the optimizer at the start of each batch                 optimizer.zero_grad()                  # Enable gradient computation only when in training phase                 with torch.set_grad_enabled(phase == 'train'):                     # Perform a forward pass through the model to get outputs                     outputs = model(inputs)                     # Get the predicted class indices by taking the max logit                     _, preds = torch.max(outputs, 1)                     # Compute the loss between model outputs and true labels                     loss = criterion(outputs, labels)                      # If in training phase, perform backpropagation and optimizer step                     if phase == 'train':                         # Backpropagate the loss                         loss.backward()                         # Update model parameters                         optimizer.step()                  # Accumulate the batch loss scaled by the batch size                 running_loss += loss.item() * inputs.size(0)                 # Accumulate the number of correct predictions                 running_corrects += torch.sum(preds == labels.data)              # Step the learning rate scheduler after finishing the training phase             if phase == 'train':                 scheduler.step()              # Compute the epoch loss by dividing total loss by dataset size             epoch_loss = running_loss / dataset_sizes[phase]             # Compute the epoch accuracy as corrects divided by dataset size             epoch_acc = running_corrects.double() / dataset_sizes[phase]              # Print the loss and accuracy for this phase             print(f'{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')              # If we are in validation phase and this accuracy is the best so far             if phase == 'val' and epoch_acc > best_acc:                 # Update the best accuracy value                 best_acc = epoch_acc                 # Save the current model weights as the best model                 best_model_wts = copy.deepcopy(model.state_dict())          # Print a blank line for better readability between epochs         print()      # Compute total training time in seconds     time_elapsed = time.time() - since     # Print the total training time in minutes and seconds     print(f'Training complete in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s')     # Print the best validation accuracy achieved during training     print(f'Best val Acc: {best_acc:4f}')      # Load the best model weights back into the model     model.load_state_dict(best_model_wts)      # Return the best model after training     return model   # Use the main guard to ensure code only runs when this script is executed directly if __name__ == "__main__":     # Set the path to the prepared Train/Val dataset directory     data_dir = "D:/Data-Sets-Image-Classification/Star-Wars-Characters-For-Classification"      # Define image transformations for training and validation datasets     datatrasforms = {         # Training data augmentation and normalization pipeline         'train': transforms.Compose([             # Resize the shortest side of the image to 256 pixels             transforms.Resize(256),             # Randomly crop a 224x224 patch from the image             transforms.RandomResizedCrop(224),             # Randomly flip the image horizontally for augmentation             transforms.RandomHorizontalFlip(),             # Convert the image to a PyTorch tensor             transforms.ToTensor(),             # Normalize the image with ImageNet mean and std             transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])         ]),         # Validation data preprocessing and normalization pipeline         'val': transforms.Compose([             # Resize the shortest side of the image to 256 pixels             transforms.Resize(256),             # Take a centered 224x224 crop from the image             transforms.CenterCrop(224),             # Convert the image to a PyTorch tensor             transforms.ToTensor(),             # Normalize the image with ImageNet mean and std             transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])         ]),     }      # Create ImageFolder datasets for train and val from the directory structure     image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x), datatrasforms[x]) for x in ['train', 'val']}     # Wrap datasets with DataLoaders for batching and shuffling     dataloaders = {x: DataLoader(image_datasets[x], batch_size=32, shuffle=True, num_workers=4) for x in ['train', 'val']}     # Get dataset sizes for calculating loss and accuracy     dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}     # Extract class names from the training dataset folder structure     class_names = image_datasets['train'].classes      # Choose GPU if available; otherwise, fall back to CPU     device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")      # Create a FasterViT model instance with pretrained weights loaded from given path     model = create_model('faster_vit_0_224', pretrained=True, model_path="d:/temp/models/faster_vit_0.pth.tar")      # Get the number of features from the model head     num_ftrs = model.head.in_features     # Replace the final classification layer with a new Linear layer for our number of classes     model.head = torch.nn.Linear(num_ftrs, len(class_names))      # Move the model to the selected device (GPU or CPU)     model = model.to(device)     # Define the cross-entropy loss function for multi-class classification     criterion = torch.nn.CrossEntropyLoss()     # Create an SGD optimizer with a learning rate and momentum     optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)     # Setup a StepLR scheduler to reduce LR every 7 epochs by a factor of 0.1     scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)      # Train the model using the train_model function for 100 epochs     model = train_model(model, criterion, optimizer, scheduler, num_epochs=100)     # Save the trained model weights to disk     torch.save(model.state_dict(), 'd:/temp/models/star_wars_faster_vit_model.pth')

Training on your custom dataset ensures the model learns distinct visual differences between your classes.
After training, the saved model weights can be reused for inference or further fine-tuning.

Testing Your Trained FasterViT Model

Once training completes, you want to verify that the model works on unseen data.
This part loads the saved model weights, prepares an input image, runs prediction, and displays the result.

Putting the predicted label onto the image makes it easy to visually confirm the model’s performance.

# Import PyTorch for model operations and tensors
import torch

# Import transforms for image preprocessing steps
from torchvision import transforms

# Import the FasterViT model creation utility
from fastervit import create_model

# Import os to work with filesystem paths
import os 

# Import OpenCV for image reading and display
import cv2

# Import NumPy for array operations
import numpy as np

# Set the initial number of classes (will be updated later)
num_classes = 50

# Create a FasterViT model instance with the specified configuration
model = create_model('faster_vit_0_224', pretrained=False)

# Adjust the classification head to match the desired number of classes
model.head = torch.nn.Linear(model.head.in_features, num_classes)

# Select GPU if available; otherwise, use CPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Move the model to the chosen device
model = model.to(device)

# Define the path to the file containing the saved model weights
model_path = 'd:/temp/models/star_wars_faster_vit_model.pth'
# Load the saved model weights from disk into the model
model.load_state_dict(torch.load(model_path, map_location=device))
# Put the model into evaluation mode to disable dropout and other training layers
model.eval() # set the model to evaluation mode


# Define preprocessing steps for input images before feeding them into the model
preprocess = transforms.Compose([
    # Convert the input NumPy array to a PIL Image
    transforms.ToPILImage(), # Convert the numpy array to PIL Image
    # Resize the image to 256 pixels on the shortest side
    transforms.Resize((256)),
    # Center crop the image to 224x224 pixels
    transforms.CenterCrop((224)),
    # Convert the image to a PyTorch tensor
    transforms.ToTensor(),
    # Normalize the image using ImageNet mean and standard deviation
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Define a helper function to load and preprocess a single image
def load_image(image_path): 
    # Read the image from disk using OpenCV (BGR format)
    image = cv2.imread(image_path) # load the image using OpenCV
    # Convert the image from BGR color space to RGB
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # convert BGR to RGB
    # Apply the preprocessing pipeline defined above
    image = preprocess(image) # apply the transformations
    # Add a batch dimension to create a 4D tensor
    image = image.unsqueeze(0)
    # Move the image tensor to the selected device (GPU or CPU)
    image = image.to(device) # move the image to GPU if available

    # Return the preprocessed image tensor
    return image


# Define a function that loads an image, runs it through the model, and returns the predicted class name
def predict(image_path , model , class_names):
    # Load and preprocess the input image
    image = load_image(image_path)
    # Disable gradient computation for inference
    with torch.no_grad():
        # Forward pass through the model to get class scores
        outputs = model(image)
        # Get the index of the class with the highest score
        _, preds = torch.max(outputs, 1)
        # Map the predicted index to the corresponding class name
        preducted_class = class_names[preds.item()]
    # Return the predicted class label
    return preducted_class
    

# Import glob to list files using wildcard patterns
from glob import glob

# Define the path to the Test folder for retrieving class names
# path for test images - to get classes names from the folder names
testPath = "D:/Data-Sets-Image-Classification/Star-Wars-Characters-For-Classification/Test"

# Get the list of class names by reading subfolder names inside the Test directory
# Get the subfolder names (class names) from the test folder
class_names = [f for f in os.listdir(testPath) if os.path.isdir(os.path.join(testPath, f))]
# Print the detected class names to verify them
print(class_names)

# Update the number of classes based on detected folder names
# define the number of classes
num_classes = len(class_names)

# Define the path to a sample image used for testing the model
imagePath = "Visual-Language-Models-Tutorials/FasterViT - StarWars - Image classification on your Custom Dataset using Fast Vision Transformers/Yoda-Test-Image.jpg"
# Call the predict function to obtain a predicted class label
predicted_class = predict(imagePath, model, class_names)
# Print the predicted class for inspection
print(f"Predicted class : {predicted_class}")


# Define a function that predicts the class and draws the predicted label on the image   
def predict_and_draw(image_path, model, class_names):
    # Load the original image using OpenCV
    # load the image 
    image = cv2.imread(image_path) # load the image using OpenCV
    # Convert the loaded image from BGR to RGB for preprocessing
    image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # convert BGR to RGB
    # Apply the same preprocessing used for training/validation
    input_tensor = preprocess(image_rgb) # apply the transformations
    # Add a batch dimension to the input tensor
    input_tensor = input_tensor.unsqueeze(0) # add batch dimension
    # Move the input tensor to the selected device
    input_tensor = input_tensor.to(device) # move the image to GPU if available

    # Disable gradient computation during inference
    with torch.no_grad():
        # Run the model to get output logits
        outputs = model(input_tensor)
        # Find the index of the highest scoring class
        _, preds = torch.max(outputs, 1)
        # Convert the index to the corresponding class name
        predicted_class = class_names[preds.item()]

    # Prepare the text label to overlay on the image
    # draw the label on the image
    text = f"Predicted: {predicted_class}"
    # Choose the font face for the text
    font = cv2.FONT_HERSHEY_SIMPLEX
    # Set font scale (size of the text)
    font_scale = 1
    # Set the thickness of the text stroke
    font_thickness = 3
    # Choose the initial position (x, y) for the text on the image
    text_x , text_y = 10 , 50 

    # Draw the predicted label on the image using the specified font and color
    cv2.putText(image, text, (text_x, text_y), font, font_scale, (0, 100, 100), font_thickness)

    # Display the result image in a window titled "Predicted Image"
    # Display the image with the label
    cv2.imshow("Predicted Image", image)
    # Wait for a key press before closing the window
    cv2.waitKey(0)
    # Close all OpenCV windows
    cv2.destroyAllWindows()

    # Set the path where the labeled output image will be saved
    # Save the image with the label
    ouput_image_path = "D:/temp/predicted_image.jpg"
    # Write the modified image with the prediction to disk
    cv2.imwrite(ouput_image_path, image)
    # Print the path to the saved predicted image
    print(f"Predicted image saved at: {ouput_image_path}")


# Call the helper function to predict and visualize the class on the test image
# Run the function on a test image
predict_and_draw(imagePath, model, class_names)

# Import PyTorch for model operations and tensors import torch  # Import transforms for image preprocessing steps from torchvision import transforms  # Import the FasterViT model creation utility from fastervit import create_model  # Import os to work with filesystem paths import os   # Import OpenCV for image reading and display import cv2  # Import NumPy for array operations import numpy as np  # Set the initial number of classes (will be updated later) num_classes = 50  # Create a FasterViT model instance with the specified configuration model = create_model('faster_vit_0_224', pretrained=False)  # Adjust the classification head to match the desired number of classes model.head = torch.nn.Linear(model.head.in_features, num_classes)  # Select GPU if available; otherwise, use CPU device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # Move the model to the chosen device model = model.to(device)  # Define the path to the file containing the saved model weights model_path = 'd:/temp/models/star_wars_faster_vit_model.pth' # Load the saved model weights from disk into the model model.load_state_dict(torch.load(model_path, map_location=device)) # Put the model into evaluation mode to disable dropout and other training layers model.eval() # set the model to evaluation mode   # Define preprocessing steps for input images before feeding them into the model preprocess = transforms.Compose([     # Convert the input NumPy array to a PIL Image     transforms.ToPILImage(), # Convert the numpy array to PIL Image     # Resize the image to 256 pixels on the shortest side     transforms.Resize((256)),     # Center crop the image to 224x224 pixels     transforms.CenterCrop((224)),     # Convert the image to a PyTorch tensor     transforms.ToTensor(),     # Normalize the image using ImageNet mean and standard deviation     transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ])  # Define a helper function to load and preprocess a single image def load_image(image_path):      # Read the image from disk using OpenCV (BGR format)     image = cv2.imread(image_path) # load the image using OpenCV     # Convert the image from BGR color space to RGB     image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # convert BGR to RGB     # Apply the preprocessing pipeline defined above     image = preprocess(image) # apply the transformations     # Add a batch dimension to create a 4D tensor     image = image.unsqueeze(0)     # Move the image tensor to the selected device (GPU or CPU)     image = image.to(device) # move the image to GPU if available      # Return the preprocessed image tensor     return image   # Define a function that loads an image, runs it through the model, and returns the predicted class name def predict(image_path , model , class_names):     # Load and preprocess the input image     image = load_image(image_path)     # Disable gradient computation for inference     with torch.no_grad():         # Forward pass through the model to get class scores         outputs = model(image)         # Get the index of the class with the highest score         _, preds = torch.max(outputs, 1)         # Map the predicted index to the corresponding class name         preducted_class = class_names[preds.item()]     # Return the predicted class label     return preducted_class       # Import glob to list files using wildcard patterns from glob import glob  # Define the path to the Test folder for retrieving class names # path for test images - to get classes names from the folder names testPath = "D:/Data-Sets-Image-Classification/Star-Wars-Characters-For-Classification/Test"  # Get the list of class names by reading subfolder names inside the Test directory # Get the subfolder names (class names) from the test folder class_names = [f for f in os.listdir(testPath) if os.path.isdir(os.path.join(testPath, f))] # Print the detected class names to verify them print(class_names)  # Update the number of classes based on detected folder names # define the number of classes num_classes = len(class_names)  # Define the path to a sample image used for testing the model imagePath = "Visual-Language-Models-Tutorials/FasterViT - StarWars - Image classification on your Custom Dataset using Fast Vision Transformers/Yoda-Test-Image.jpg" # Call the predict function to obtain a predicted class label predicted_class = predict(imagePath, model, class_names) # Print the predicted class for inspection print(f"Predicted class : {predicted_class}")   # Define a function that predicts the class and draws the predicted label on the image    def predict_and_draw(image_path, model, class_names):     # Load the original image using OpenCV     # load the image      image = cv2.imread(image_path) # load the image using OpenCV     # Convert the loaded image from BGR to RGB for preprocessing     image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # convert BGR to RGB     # Apply the same preprocessing used for training/validation     input_tensor = preprocess(image_rgb) # apply the transformations     # Add a batch dimension to the input tensor     input_tensor = input_tensor.unsqueeze(0) # add batch dimension     # Move the input tensor to the selected device     input_tensor = input_tensor.to(device) # move the image to GPU if available      # Disable gradient computation during inference     with torch.no_grad():         # Run the model to get output logits         outputs = model(input_tensor)         # Find the index of the highest scoring class         _, preds = torch.max(outputs, 1)         # Convert the index to the corresponding class name         predicted_class = class_names[preds.item()]      # Prepare the text label to overlay on the image     # draw the label on the image     text = f"Predicted: {predicted_class}"     # Choose the font face for the text     font = cv2.FONT_HERSHEY_SIMPLEX     # Set font scale (size of the text)     font_scale = 1     # Set the thickness of the text stroke     font_thickness = 3     # Choose the initial position (x, y) for the text on the image     text_x , text_y = 10 , 50       # Draw the predicted label on the image using the specified font and color     cv2.putText(image, text, (text_x, text_y), font, font_scale, (0, 100, 100), font_thickness)      # Display the result image in a window titled "Predicted Image"     # Display the image with the label     cv2.imshow("Predicted Image", image)     # Wait for a key press before closing the window     cv2.waitKey(0)     # Close all OpenCV windows     cv2.destroyAllWindows()      # Set the path where the labeled output image will be saved     # Save the image with the label     ouput_image_path = "D:/temp/predicted_image.jpg"     # Write the modified image with the prediction to disk     cv2.imwrite(ouput_image_path, image)     # Print the path to the saved predicted image     print(f"Predicted image saved at: {ouput_image_path}")   # Call the helper function to predict and visualize the class on the test image # Run the function on a test image predict_and_draw(imagePath, model, class_names)

Testing confirms that the model can generalize to new images and gives you visual feedback on its performance.
With this working pipeline, you can classify any image into your defined classes.

FAQ

What is FasterViT image classification?

FasterViT image classification uses a hybrid model combining convolution and transformer layers to learn image features and assign class labels efficiently.

Why split the dataset into train, val, and test?

Splitting allows the model to learn patterns (train), tune performance (val), and evaluate generalization (test) for reliable results.

What does the scheduler do in training?

The scheduler reduces the learning rate over time, helping stabilize training and improve final accuracy.

Why normalize images before training?

Normalization ensures images have consistent pixel statistics, which helps the model converge faster.

Do I need GPU for this tutorial?

No, but GPU speeds up training significantly, especially with large datasets and transformer blocks.

Can I reuse the trained model for other datasets?

Yes, you can fine-tune it or retrain the head for different classes.

What library provides FasterViT?

FasterVit library offers implementations of the FasterViT architecture compatible with PyTorch.

How do I visualize predictions?

Predictions are written onto images using OpenCV’s text overlay and display functions.

What is the main loss function used?

CrossEntropyLoss is used, which is common for multi-class classification tasks.

Summary

In this complete FasterViT image classification tutorial, you learned how to:

✔ Set up a stable Python + PyTorch environment
✔ Prepare a custom dataset for training and evaluation
✔ Train a FasterViT model from scratch on your own images
✔ Test and visualize predictions on new data

FasterViT’s hybrid architecture gives you the speed of CNNs and the global context power of transformers, perfect for modern image classification tasks.

Conclusion

You now have a complete, end-to-end FasterViT image classification workflow using a custom dataset.
This setup lets you take real images, split them into structured data, train a powerful hybrid model, and test predictions visually and programmatically.

Transformer-based architectures like FasterViT bring the capability to understand global image context while keeping the efficiency of convolutional representations.
By mastering this pipeline, you unlock a flexible pattern you can reuse across different domains, from character recognition to industrial image categorization.

Connect :

☕ Buy me a coffee — https://ko-fi.com/eranfeit

🖥️ Email : feitgemel@gmail.com

🌐 https://eranfeit.net

🤝 Fiverr : https://www.fiverr.com/s/mB3Pbb

Enjoy,

Eran