Last Updated on 02/01/2026 by Eran Feit
🧠 Introduction — FasterViT Image Classification Using Custom Dataset
FasterViT image classification using custom dataset represents a modern, efficient approach to training deep learning models that can recognize and categorize images from your own tailored collection of visual data. In a world where off-the-shelf datasets often don’t match specific application needs, applying models like FasterViT to a custom dataset — such as a curated set of Star Wars character images — allows developers to build computer vision systems that reflect real-world use cases and unique classification requirements. This customization is particularly powerful for niche tasks that require specialized image recognition beyond generic object categories, making FasterViT an exciting choice for advanced computer vision practitioners.
At its core, FasterViT combines the strengths of traditional convolutional neural networks (CNNs), which excel at capturing detailed local features, with the global context modeling of vision transformers. This hybrid architecture enables the model to learn both fine-grained texture patterns and broad relational information across an image, leading to robust feature representations. When applied to a custom dataset, better feature extraction and classification accuracy can be achieved compared to using either a pure CNN or a pure transformer model alone.
Training FasterViT on a custom dataset introduces challenges and opportunities. Developers must carefully prepare the data — organizing training, validation, and testing splits, and ensuring proper image preprocessing — so the model can effectively generalize from limited or imbalanced samples. With proper dataset preparation and training strategies, custom FasterViT models can outperform many conventional deep learning classifiers, especially in tasks that require distinguishing fine differences between similar image classes.
Finally, integrating FasterViT into real-world applications demonstrates its practical value. Whether building a character recognizer from a Star Wars image set, designing a wildlife species classifier, or developing a custom industrial defect detector, FasterViT’s ability to leverage custom datasets makes it a highly adaptable tool in the computer vision ecosystem. As deep learning continues to evolve, models like FasterViT that blend efficiency with performance are key for developers who need both speed and accuracy in specialized image classification tasks.
🌟 What Is FasterViT Image Classification and Why Custom Data Matters?
FasterViT image classification is a deep learning approach that blends convolutional neural networks with vision transformer elements to effectively classify images. Unlike traditional models that focus purely on either local patterns (CNNs) or global relationships (transformers), FasterViT uses a hybrid architecture that takes advantage of the best of both worlds. This makes it especially suitable for image classification tasks where details matter and large-scale context influences predictions, such as distinguishing between visually similar characters or objects.
When working with custom datasets — like a unique Star Wars image collection — conventional models often fall short due to limited examples or insufficient context. Custom datasets reflect real-world problems these models need to solve, such as identifying specific character traits or unusual visual features unseen in benchmark datasets. Training FasterViT on these tailored images gives the model exposure to exactly the visual domains it will encounter during inference, improving accuracy and robustness.
Moreover, FasterViT’s hierarchical attention mechanism enables efficient processing of visual features at different scales. Early layers might focus on small, local details in images — edges, textures, or character accessories — while later transformer-like layers capture broader patterns or global relationships across the entire image. This layered attention system supports the model’s ability to generalize from a custom dataset to unseen test images, making it well-suited for specialized vision tasks that vary significantly from popular benchmark collections.
In practical application, this means that when you feed FasterViT a custom dataset of labeled images, the model can learn both subtle and global cues needed to make reliable classifications. Whether your goal is to recognize specific individuals, object categories, or nuanced visual differences dictated by your dataset’s uniqueness, FasterViT’s flexible architecture enables innovative solutions. By training this model on your curated images, you customize the learning process — and the resulting classifier becomes tailored, higher performing, and more relevant to your specific needs.
The FasterViT architecture

The FasterViT architecture is built as a hybrid model that combines the strengths of convolutional neural networks and vision transformers in a single unified pipeline. The network begins with traditional convolutional layers, which downsample the input image and extract low-level visual features such as edges, textures, and simple spatial patterns. These convolution stages are efficient and computationally light, making them ideal for early processing where fine-grained image structure matters most. By progressively reducing the spatial resolution while increasing channel depth, the model prepares compact yet informative feature maps for the next stages.
After the initial convolution blocks, FasterViT transitions into deeper stages that incorporate hierarchical attention. This is where the transformer-based components come into play. Hierarchical attention allows the model to capture long-range dependencies in the image — understanding how different regions relate to one another, even when they are far apart spatially. This global reasoning is what makes transformer-based architectures particularly powerful for image understanding, as the model is no longer limited to only local receptive fields like a CNN. FasterViT carefully balances attention computation so it remains efficient while still modeling complex contextual relationships.
In the later stages, the model continues to alternate between downsampling and attention-driven processing, building increasingly abstract feature representations. By the time the data reaches the classification head, the network has learned both detailed local information and high-level contextual structure. This combination enables FasterViT to achieve strong accuracy on image classification tasks while remaining computationally efficient compared to pure transformer models. The architecture is therefore especially useful for real-world applications where both performance and speed matter, making it a compelling evolution in the vision transformer family.

Building a Practical FasterViT Image Classification Tutorial with a Custom Dataset
This tutorial walks through the full process of using FasterViT image classification on a custom dataset, showing how to install the required libraries, prepare the dataset, train the model, and finally test it on new images. The goal is to give you a complete, working pipeline so you can adapt the same approach to any dataset you choose — whether that’s animals, products, vehicles, or any type of labeled image collection. Everything is designed to be hands-on and code-driven so you can follow along step-by-step.
The core idea behind the tutorial is to demonstrate how FasterViT — a hybrid model that combines convolutional layers with transformer-based attention — can be trained on real-world images rather than relying only on standard benchmark datasets. You’ll see how the dataset is split into training, validation, and test sets, how transformations are applied, and how the model learns to identify each class. By the end, you’ll understand not only how the code works, but also why each stage of the workflow is important.
Another key part of the tutorial is modifying the model’s final classification head to match the number of classes in your dataset. This ensures FasterViT can correctly predict the labels you define. The training loop also tracks accuracy and loss over time, so you can monitor how well the model is learning and automatically keep the best-performing weights. This makes the process robust and beginner-friendly, while still being powerful enough for advanced experimentation.
Finally, the tutorial shows how to load your trained model and run predictions on new images. You’ll see how to preprocess test images, send them through the model, and display the predicted class — even overlaying the result on the image itself. This allows you to move from raw data → trained model → working classifier, creating a complete solution that you can reuse and scale for future projects.
Link to the video tutorial : https://youtu.be/n-SpVoHrzDQ
You cand download the code for the tutorial here : https://eranfeit.lemonsqueezy.com/checkout/buy/a6159108-c66c-4e21-80e0-7a6589f0b8b0 or here : https://ko-fi.com/s/28ca45253c
Link for to the post for Medium users : https://medium.com/vision-transformers-tutorials/fastervit-image-classification-using-custom-dataset-star-wars-dataset-8e6ce470d566
You can follow my blog here : https://eranfeit.net/blog/
Want to get started with Computer Vision or take your skills to the next level ?
Great Interactive Course : “Deep Learning for Images with PyTorch” here : https://datacamp.pxf.io/zxWxnm
If you’re just beginning, I recommend this step-by-step course designed to introduce you to the foundations of Computer Vision – Complete Computer Vision Bootcamp With PyTorch & TensorFlow
If you’re already experienced and looking for more advanced techniques, check out this deep-dive course – Modern Computer Vision GPT, PyTorch, Keras, OpenCV4
FasterViT Image Classification Using Custom Dataset
FasterViT is a powerful hybrid model that combines the feature extraction strengths of convolutional neural networks with the global contextual understanding of vision transformers.
In this post, you’ll build a complete image classification pipeline using FasterViT on your own custom dataset — the Star Wars characters dataset.
Each part breaks down the code into digestible steps with clear explanations so you can follow along and adapt it to any dataset.
The goal is practical mastery: from environment setup, dataset preparation, training, and testing, you will walk through a real-world workflow from top to bottom.
Think of this as a roadmap you can reuse to train FasterViT models on any custom image classification task, using PyTorch and FasterVit libraries.
Setting Up the Environment
Before training FasterViT on your custom dataset, your system must be ready.
This section creates a dedicated Conda environment and installs PyTorch with GPU support, along with FasterVit and necessary Python packages.
Isolating dependencies in a new environment prevents version conflicts and ensures compatibility for deep learning workflows.
You’ll also check the CUDA version to ensure GPU acceleration is available, which significantly speeds up training.
The specific package versions used here are selected for stability and reproducibility so you can train efficiently without unexpected errors.
# Create a new Conda environment named "fasterVit" with Python 3.11 conda create -n fasterVit python=3.11 # Activate the newly created "fasterVit" environment conda activate fasterVit # Check the installed CUDA version on your system nvcc --version # Install PyTorch 2.5.0, Torchvision, Torchaudio, and CUDA 12.4 support from PyTorch and NVIDIA channels conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.4 -c pytorch -c nvidia # Install the FasterVit library version 0.9.8 from PyPI pip install fastervit==0.9.8 # Install timm (PyTorch Image Models) version 0.9.12 for model utilities and backbones pip install timm==0.9.12 # Install matplotlib for plotting and visualization pip install matplotlib # Install OpenCV Python bindings (note: command is missing the 'install' verb as written) pip opencv-python==4.10.0.84 This environment gives you all the tools needed to implement FasterViT training and testing without friction.
Now, you can focus on the dataset and model logic rather than debugging library conflicts.
Downloading and Understanding the Custom Dataset
Before we start training the FasterViT image classification model, we first need a labeled image dataset. In this tutorial, we are working with a collection of character images that are already grouped into folders by class. Each folder represents one category, and every image inside belongs to that specific label. This structure is very important because PyTorch relies on the folder layout to automatically map images to class names during training.
The dataset contains multiple characters, and each character has a number of image samples from different angles, lighting conditions, and backgrounds. This variation helps the model generalize and learn what truly defines each class, rather than memorizing a single image pattern. If your dataset includes clear, centered subjects and consistent labeling, you will normally achieve stronger and more reliable classification results.
When you download your dataset, extract it into a directory on your machine where you plan to work. In our case, the raw dataset is stored in a folder called:
D:/Data-Sets-Image-Classification/Star-Wars-Characters Inside this folder, each sub-folder represents a different class. For example:
Star-Wars-Characters/ ├── Class_1/ ├── Class_2/ ├── Class_3/ ├── ... This means you do not need a CSV file or manual labels — the folder names themselves act as the class labels. Later, our script automatically splits these images into Train, Validation, and Test folders while preserving the class-based structure. If you ever decide to swap this dataset with your own images, just keep the same folder-per-class layout and the rest of the code will continue to work smoothly.
Preparing Your Dataset for Image Classification
The next step is organizing your custom dataset into training, validation, and testing splits.
This allows the model to learn from training examples, tune performance on validation data, and evaluate generalization on the test set.
This code creates appropriate folder structures, randomly shuffles images, and distributes them into the right folders.
A balanced dataset structure ensures that training and validation samples represent all classes evenly.
# Add a comment indicating the dataset source URL used for this project # Dataset : https://www.kaggle.com/datasets/adamridene/star-wars-characters # Import the os module for filesystem path and directory operations import os # Import shutil to copy files and manage file operations import shutil # Import random to shuffle image lists before splitting into sets import random # Define a helper function to create Train, Val, and Test folders for each category def create_folders(base_path, categories): # Loop over every detected category name for category in categories: # Create the Train subfolder for the current category (if it doesn't already exist) os.makedirs(os.path.join(base_path, 'Train', category), exist_ok=True) # Create the Val subfolder for the current category (if it doesn't already exist) os.makedirs(os.path.join(base_path, 'Val', category), exist_ok=True) # Create the Test subfolder for the current category (if it doesn't already exist) os.makedirs(os.path.join(base_path, 'Test', category), exist_ok=True) # Define a function to split data into train, validation, and test subsets def split_data(source_folder, dest_folder, train_ratio=0.7, validate_ratio=0.2): # Get the list of subfolders (categories) inside the source folder categories = [d for d in os.listdir(source_folder) if os.path.isdir(os.path.join(source_folder, d))] # Ensure all required destination folders (Train/Val/Test per category) exist create_folders(dest_folder, categories) # Iterate over each category to process its images for category in categories: # Build the full path to the current category directory category_path = os.path.join(source_folder, category) # List all image files within this category directory images = [f for f in os.listdir(category_path) if os.path.isfile(os.path.join(category_path, f))] # Randomly shuffle the images to avoid ordering bias random.shuffle(images) # Calculate the index at which the training subset ends train_split = int(len(images) * train_ratio) # Calculate the index at which the validation subset ends validate_split = int(len(images) * (train_ratio + validate_ratio)) # Select the training images based on the first split train_images = images[:train_split] # Select the validation images between train_split and validate_split validate_images = images[train_split:validate_split] # The remaining images belong to the test set test_images = images[validate_split:] # Copy each training image into the corresponding Train/category folder for image in train_images: shutil.copy(os.path.join(category_path, image), os.path.join(dest_folder, 'Train', category, image)) # Copy each validation image into the corresponding Val/category folder for image in validate_images: shutil.copy(os.path.join(category_path, image), os.path.join(dest_folder, 'Val', category, image)) # Copy each test image into the corresponding Test/category folder for image in test_images: shutil.copy(os.path.join(category_path, image), os.path.join(dest_folder, 'Test', category, image)) # Define the original dataset folder containing the class subfolders source_folder = 'D:/Data-Sets-Image-Classification/Star-Wars-Characters' # Define the destination folder where the Train/Val/Test structure will be created dest_folder = 'D:/Data-Sets-Image-Classification/Star-Wars-Characters-For-Classification' # Call the split_data function to perform the splitting operation split_data(source_folder, dest_folder) With your images split into folders, the model can now iterate over them in training and validation loops.
This structure is compatible with PyTorch’s dataset utilities, making the next part seamless.
Training the FasterViT Model
Now comes the heart of the pipeline: training the FasterViT model on your custom dataset.
The training function handles multiple epochs, computes loss and accuracy, and saves the best model weights based on validation performance.
This code uses standard PyTorch structures like data loaders, optimizers, schedulers, and loss functions.
It ensures your model trains efficiently and tracks progress over time.
# Import os to handle file paths and directory operations import os # Import the core PyTorch library import torch # Import torchvision datasets and transforms for image loading and preprocessing from torchvision import datasets, transforms # Import DataLoader to batch and iterate over datasets from torch.utils.data import DataLoader # Import create_model from fastervit to construct the FasterViT architecture from fastervit import create_model # Import PyTorch's optimization module import torch.optim as optim # Import learning rate scheduler utilities from torch.optim import lr_scheduler # Import time to measure training duration import time # Import copy to deep copy model weights when tracking the best model import copy # Define a training loop function for the model def train_model(model , criterion , optimizer , scheduler , num_epochs): # Record the starting time of the training process since = time.time() # Make a deep copy of the model's initial weights to store the best version best_model_wts = copy.deepcopy(model.state_dict()) # Initialize the best accuracy with zero best_acc = 0.0 # Loop over each epoch in the training process for epoch in range(num_epochs): # Print the current epoch index and total epochs print(f'Epoch {epoch}/{num_epochs - 1}') # Print a visual separator line print('-' * 10) # Each epoch has both a training phase and a validation phase for phase in ['train', 'val']: # Set the model to training mode during the train phase if phase == 'train': model.train() # Set the model to evaluation mode during the validation phase else : model.eval() # Initialize running loss for the epoch running_loss = 0.0 # Initialize running correct predictions for the epoch running_corrects = 0 # Iterate over batches from the dataloader of the current phase for inputs, labels in dataloaders[phase]: # Move input images to the selected device (CPU or GPU) inputs = inputs.to(device) # Move labels to the selected device labels = labels.to(device) # Reset gradients of the optimizer at the start of each batch optimizer.zero_grad() # Enable gradient computation only when in training phase with torch.set_grad_enabled(phase == 'train'): # Perform a forward pass through the model to get outputs outputs = model(inputs) # Get the predicted class indices by taking the max logit _, preds = torch.max(outputs, 1) # Compute the loss between model outputs and true labels loss = criterion(outputs, labels) # If in training phase, perform backpropagation and optimizer step if phase == 'train': # Backpropagate the loss loss.backward() # Update model parameters optimizer.step() # Accumulate the batch loss scaled by the batch size running_loss += loss.item() * inputs.size(0) # Accumulate the number of correct predictions running_corrects += torch.sum(preds == labels.data) # Step the learning rate scheduler after finishing the training phase if phase == 'train': scheduler.step() # Compute the epoch loss by dividing total loss by dataset size epoch_loss = running_loss / dataset_sizes[phase] # Compute the epoch accuracy as corrects divided by dataset size epoch_acc = running_corrects.double() / dataset_sizes[phase] # Print the loss and accuracy for this phase print(f'{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}') # If we are in validation phase and this accuracy is the best so far if phase == 'val' and epoch_acc > best_acc: # Update the best accuracy value best_acc = epoch_acc # Save the current model weights as the best model best_model_wts = copy.deepcopy(model.state_dict()) # Print a blank line for better readability between epochs print() # Compute total training time in seconds time_elapsed = time.time() - since # Print the total training time in minutes and seconds print(f'Training complete in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s') # Print the best validation accuracy achieved during training print(f'Best val Acc: {best_acc:4f}') # Load the best model weights back into the model model.load_state_dict(best_model_wts) # Return the best model after training return model # Use the main guard to ensure code only runs when this script is executed directly if __name__ == "__main__": # Set the path to the prepared Train/Val dataset directory data_dir = "D:/Data-Sets-Image-Classification/Star-Wars-Characters-For-Classification" # Define image transformations for training and validation datasets datatrasforms = { # Training data augmentation and normalization pipeline 'train': transforms.Compose([ # Resize the shortest side of the image to 256 pixels transforms.Resize(256), # Randomly crop a 224x224 patch from the image transforms.RandomResizedCrop(224), # Randomly flip the image horizontally for augmentation transforms.RandomHorizontalFlip(), # Convert the image to a PyTorch tensor transforms.ToTensor(), # Normalize the image with ImageNet mean and std transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]), # Validation data preprocessing and normalization pipeline 'val': transforms.Compose([ # Resize the shortest side of the image to 256 pixels transforms.Resize(256), # Take a centered 224x224 crop from the image transforms.CenterCrop(224), # Convert the image to a PyTorch tensor transforms.ToTensor(), # Normalize the image with ImageNet mean and std transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]), } # Create ImageFolder datasets for train and val from the directory structure image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x), datatrasforms[x]) for x in ['train', 'val']} # Wrap datasets with DataLoaders for batching and shuffling dataloaders = {x: DataLoader(image_datasets[x], batch_size=32, shuffle=True, num_workers=4) for x in ['train', 'val']} # Get dataset sizes for calculating loss and accuracy dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']} # Extract class names from the training dataset folder structure class_names = image_datasets['train'].classes # Choose GPU if available; otherwise, fall back to CPU device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # Create a FasterViT model instance with pretrained weights loaded from given path model = create_model('faster_vit_0_224', pretrained=True, model_path="d:/temp/models/faster_vit_0.pth.tar") # Get the number of features from the model head num_ftrs = model.head.in_features # Replace the final classification layer with a new Linear layer for our number of classes model.head = torch.nn.Linear(num_ftrs, len(class_names)) # Move the model to the selected device (GPU or CPU) model = model.to(device) # Define the cross-entropy loss function for multi-class classification criterion = torch.nn.CrossEntropyLoss() # Create an SGD optimizer with a learning rate and momentum optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9) # Setup a StepLR scheduler to reduce LR every 7 epochs by a factor of 0.1 scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1) # Train the model using the train_model function for 100 epochs model = train_model(model, criterion, optimizer, scheduler, num_epochs=100) # Save the trained model weights to disk torch.save(model.state_dict(), 'd:/temp/models/star_wars_faster_vit_model.pth') Training on your custom dataset ensures the model learns distinct visual differences between your classes.
After training, the saved model weights can be reused for inference or further fine-tuning.
Testing Your Trained FasterViT Model
Once training completes, you want to verify that the model works on unseen data.
This part loads the saved model weights, prepares an input image, runs prediction, and displays the result.
Putting the predicted label onto the image makes it easy to visually confirm the model’s performance.
# Import PyTorch for model operations and tensors import torch # Import transforms for image preprocessing steps from torchvision import transforms # Import the FasterViT model creation utility from fastervit import create_model # Import os to work with filesystem paths import os # Import OpenCV for image reading and display import cv2 # Import NumPy for array operations import numpy as np # Set the initial number of classes (will be updated later) num_classes = 50 # Create a FasterViT model instance with the specified configuration model = create_model('faster_vit_0_224', pretrained=False) # Adjust the classification head to match the desired number of classes model.head = torch.nn.Linear(model.head.in_features, num_classes) # Select GPU if available; otherwise, use CPU device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # Move the model to the chosen device model = model.to(device) # Define the path to the file containing the saved model weights model_path = 'd:/temp/models/star_wars_faster_vit_model.pth' # Load the saved model weights from disk into the model model.load_state_dict(torch.load(model_path, map_location=device)) # Put the model into evaluation mode to disable dropout and other training layers model.eval() # set the model to evaluation mode # Define preprocessing steps for input images before feeding them into the model preprocess = transforms.Compose([ # Convert the input NumPy array to a PIL Image transforms.ToPILImage(), # Convert the numpy array to PIL Image # Resize the image to 256 pixels on the shortest side transforms.Resize((256)), # Center crop the image to 224x224 pixels transforms.CenterCrop((224)), # Convert the image to a PyTorch tensor transforms.ToTensor(), # Normalize the image using ImageNet mean and standard deviation transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) # Define a helper function to load and preprocess a single image def load_image(image_path): # Read the image from disk using OpenCV (BGR format) image = cv2.imread(image_path) # load the image using OpenCV # Convert the image from BGR color space to RGB image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # convert BGR to RGB # Apply the preprocessing pipeline defined above image = preprocess(image) # apply the transformations # Add a batch dimension to create a 4D tensor image = image.unsqueeze(0) # Move the image tensor to the selected device (GPU or CPU) image = image.to(device) # move the image to GPU if available # Return the preprocessed image tensor return image # Define a function that loads an image, runs it through the model, and returns the predicted class name def predict(image_path , model , class_names): # Load and preprocess the input image image = load_image(image_path) # Disable gradient computation for inference with torch.no_grad(): # Forward pass through the model to get class scores outputs = model(image) # Get the index of the class with the highest score _, preds = torch.max(outputs, 1) # Map the predicted index to the corresponding class name preducted_class = class_names[preds.item()] # Return the predicted class label return preducted_class # Import glob to list files using wildcard patterns from glob import glob # Define the path to the Test folder for retrieving class names # path for test images - to get classes names from the folder names testPath = "D:/Data-Sets-Image-Classification/Star-Wars-Characters-For-Classification/Test" # Get the list of class names by reading subfolder names inside the Test directory # Get the subfolder names (class names) from the test folder class_names = [f for f in os.listdir(testPath) if os.path.isdir(os.path.join(testPath, f))] # Print the detected class names to verify them print(class_names) # Update the number of classes based on detected folder names # define the number of classes num_classes = len(class_names) # Define the path to a sample image used for testing the model imagePath = "Visual-Language-Models-Tutorials/FasterViT - StarWars - Image classification on your Custom Dataset using Fast Vision Transformers/Yoda-Test-Image.jpg" # Call the predict function to obtain a predicted class label predicted_class = predict(imagePath, model, class_names) # Print the predicted class for inspection print(f"Predicted class : {predicted_class}") # Define a function that predicts the class and draws the predicted label on the image def predict_and_draw(image_path, model, class_names): # Load the original image using OpenCV # load the image image = cv2.imread(image_path) # load the image using OpenCV # Convert the loaded image from BGR to RGB for preprocessing image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # convert BGR to RGB # Apply the same preprocessing used for training/validation input_tensor = preprocess(image_rgb) # apply the transformations # Add a batch dimension to the input tensor input_tensor = input_tensor.unsqueeze(0) # add batch dimension # Move the input tensor to the selected device input_tensor = input_tensor.to(device) # move the image to GPU if available # Disable gradient computation during inference with torch.no_grad(): # Run the model to get output logits outputs = model(input_tensor) # Find the index of the highest scoring class _, preds = torch.max(outputs, 1) # Convert the index to the corresponding class name predicted_class = class_names[preds.item()] # Prepare the text label to overlay on the image # draw the label on the image text = f"Predicted: {predicted_class}" # Choose the font face for the text font = cv2.FONT_HERSHEY_SIMPLEX # Set font scale (size of the text) font_scale = 1 # Set the thickness of the text stroke font_thickness = 3 # Choose the initial position (x, y) for the text on the image text_x , text_y = 10 , 50 # Draw the predicted label on the image using the specified font and color cv2.putText(image, text, (text_x, text_y), font, font_scale, (0, 100, 100), font_thickness) # Display the result image in a window titled "Predicted Image" # Display the image with the label cv2.imshow("Predicted Image", image) # Wait for a key press before closing the window cv2.waitKey(0) # Close all OpenCV windows cv2.destroyAllWindows() # Set the path where the labeled output image will be saved # Save the image with the label ouput_image_path = "D:/temp/predicted_image.jpg" # Write the modified image with the prediction to disk cv2.imwrite(ouput_image_path, image) # Print the path to the saved predicted image print(f"Predicted image saved at: {ouput_image_path}") # Call the helper function to predict and visualize the class on the test image # Run the function on a test image predict_and_draw(imagePath, model, class_names) Testing confirms that the model can generalize to new images and gives you visual feedback on its performance.
With this working pipeline, you can classify any image into your defined classes.
FAQ
What is FasterViT image classification?
FasterViT image classification uses a hybrid model combining convolution and transformer layers to learn image features and assign class labels efficiently.
Why split the dataset into train, val, and test?
Splitting allows the model to learn patterns (train), tune performance (val), and evaluate generalization (test) for reliable results.
What does the scheduler do in training?
The scheduler reduces the learning rate over time, helping stabilize training and improve final accuracy.
Why normalize images before training?
Normalization ensures images have consistent pixel statistics, which helps the model converge faster.
Do I need GPU for this tutorial?
No, but GPU speeds up training significantly, especially with large datasets and transformer blocks.
Can I reuse the trained model for other datasets?
Yes, you can fine-tune it or retrain the head for different classes.
What library provides FasterViT?
FasterVit library offers implementations of the FasterViT architecture compatible with PyTorch.
How do I visualize predictions?
Predictions are written onto images using OpenCV’s text overlay and display functions.
What is the main loss function used?
CrossEntropyLoss is used, which is common for multi-class classification tasks.
Summary
In this complete FasterViT image classification tutorial, you learned how to:
✔ Set up a stable Python + PyTorch environment
✔ Prepare a custom dataset for training and evaluation
✔ Train a FasterViT model from scratch on your own images
✔ Test and visualize predictions on new data
FasterViT’s hybrid architecture gives you the speed of CNNs and the global context power of transformers, perfect for modern image classification tasks.
Conclusion
You now have a complete, end-to-end FasterViT image classification workflow using a custom dataset.
This setup lets you take real images, split them into structured data, train a powerful hybrid model, and test predictions visually and programmatically.
Transformer-based architectures like FasterViT bring the capability to understand global image context while keeping the efficiency of convolutional representations.
By mastering this pipeline, you unlock a flexible pattern you can reuse across different domains, from character recognition to industrial image categorization.
Connect :
☕ Buy me a coffee — https://ko-fi.com/eranfeit
🖥️ Email : feitgemel@gmail.com
🤝 Fiverr : https://www.fiverr.com/s/mB3Pbb
Enjoy,
Eran
