Automatic Image Annotation with Autodistill and YOLOv8

/ Object Detection, Pytorch

Last Updated on 29/11/2025 by Eran Feit

Automatic image annotation is all about teaching machines to tag images for us.
Instead of a human drawing every bounding box and typing every label, models learn to recognize patterns and automatically assign classes like horse, car, or person to each object in a picture or video frame.
This drastically reduces the manual work needed to build high-quality datasets, which is often the biggest bottleneck in computer vision projects.

In modern workflows, automatic image annotation usually sits between raw data collection and model training.
You gather images or video, run an automatic annotator over them, and get labeled data in formats like YOLO, COCO, or Pascal VOC.
From there, you can train custom detectors and segmenters without spending weeks clicking boxes in an annotation tool.
The idea is not to completely remove humans, but to move them into a lighter review and correction role instead of every label being drawn from scratch.

As models get better, automatic image annotation becomes more accurate and flexible.
Vision-language models, foundation models, and specialized detectors can all be used to propose labels based on prompts or predefined ontologies.
You can, for example, ask a model to find “horses in a race,” “players on a basketball court,” or “cars in a highway scene,” and it will output bounding boxes and classes that are good enough to train smaller, task-specific models.

This approach scales incredibly well.
Once the pipeline is in place, you can run it on thousands or millions of images, generating a labeled dataset in hours instead of months.
That’s why automatic image annotation is becoming a core building block for anyone who wants to move quickly from a raw video or image collection to a production-ready deep learning model.

Why automatic image annotation is a game changer

Automatic image annotation gives you a practical way to turn unstructured visual data into training-ready datasets.
The main target is simple: start from raw images or video frames and finish with labeled data that your models can immediately learn from.
Instead of spending your time on repetitive labeling tasks, you’re focusing on designing the ontology, choosing the right base models, and validating the results.

At a high level, a typical workflow begins with defining what you want to detect.
You choose your classes and labels, such as different types of vehicles, animals, or actions, and decide on the format you need (for example YOLO labels with normalized coordinates).
Then, a model is used to automatically scan each image, detect objects, and output bounding boxes and class IDs in the right structure.
This step can be driven by object detectors, vision-language models, or specialized auto-label frameworks.

The next goal is quality.
Automatic image annotation is powerful, but it’s never perfect, so you usually add thresholds, confidence filters, and sometimes a light human review step.
Low-confidence detections can be dropped; obvious mistakes can be fixed; and edge cases can be flagged for manual labeling.
Over time, as you refine your ontology and parameters, the auto-labeling pipeline becomes more reliable and produces cleaner datasets.

Finally, everything feeds directly into training.
Once your images are annotated automatically, you plug the resulting dataset into a training script for models like YOLOv8.
Because the labels follow a standard format, you can experiment with different architectures, hyperparameters, and augmentation strategies without having to redo the annotation work.
This tight loop between automatic image annotation and model training is what makes the approach so attractive for real-world projects, especially when you are working with large video collections or rapidly changing data.

Auto annotation

Walking through the code for automatic image annotation

This tutorial is built as a complete, end-to-end pipeline that shows how to go from raw horse-race videos to a trained YOLOv8 model, using automatic image annotation instead of manual labeling.
The code is organized in clear steps, so you can follow along even if this is your first larger computer vision project.
You start by preparing a dedicated Conda environment, installing PyTorch with CUDA support, and adding the main libraries you need: Ultralytics YOLOv8, Supervision, Autodistill, Grounding DINO, and a few helper packages.
The goal of this first part is simple: make sure your machine is ready to handle video processing, automatic labeling, and model training without conflicts.

Once the environment is ready, the code focuses on creating a dataset from real videos.
You point the script at a folder of horse-race clips, and it automatically extracts frames every few steps using Supervision’s video tools.
These frames are saved as images on disk and a small sampling script displays a grid of thumbnails so you can visually confirm that the extracted images look good and actually cover the different moments in the race.
At this stage, you still have only raw images, but they are neatly organized and ready to be annotated.

The core of the tutorial is the automatic annotation step.
Here you define an ontology with the classes you care about, such as “horse race”, “horse”, and related variants, and pass it into Autodistill with Grounding DINO as the base model.
The code runs through all of the extracted images, detects the relevant objects, and saves YOLO-style label files next to the images in a train/valid folder structure.
This is where the big time savings come from: you get a labeled dataset without drawing boxes by hand, while still controlling thresholds and label names to keep the annotations meaningful.

After the dataset is created, the tutorial moves on to training and evaluation.
A data.yaml file describes the dataset paths and class names, and YOLOv8 is initialized from a YAML model definition.
The training script handles epochs, batch size, checkpoints, and validation, writing model weights into a dedicated checkpoints folder.
Finally, a testing script loads the best model and runs it on a held-out horse-race video, reading frames with OpenCV, running inference, and drawing bounding boxes and class names on each frame so you can watch the detector working in real time.

Taken together, the code in this tutorial teaches you how to build a practical automatic image annotation workflow around Autodistill and YOLOv8.
You learn how to prepare data from videos, auto-label it with a modern vision-language model, visualize and verify the annotations, and then train and deploy a custom object detector.
The target is not just to understand each individual command, but to see how all the pieces connect into a reusable pipeline that you can adapt to your own projects beyond horse-race detection.

Link for the video tutorial : https://youtu.be/ujEDpRmaOaU

Link for the code here : https://eranfeit.lemonsqueezy.com/buy/e9168c31-06c1-4a8c-b544-ee2ad07173dc or here : https://ko-fi.com/s/553a125fa6

Link for the post for Medium users : https://medium.com/@feitgemel/automatic-image-annotation-with-autodistill-and-yolov8-86822349c735

You can follow my blog here : https://eranfeit.net/blog/

Want to get started with Computer Vision or take your skills to the next level ?

If you’re just beginning, I recommend this step-by-step course designed to introduce you to the foundations of Computer Vision – Complete Computer Vision Bootcamp With PyTorch & TensorFlow

If you’re already experienced and looking for more advanced techniques, check out this deep-dive course – Modern Computer Vision GPT, PyTorch, Keras, OpenCV4

Automatic Image Annotation with Autodistill and YOLOv8

Automatic image annotation is one of the easiest ways to turn raw videos into training data without spending weeks drawing bounding boxes.
Instead of clicking every horse and jockey by hand, you let a strong vision model propose labels and bounding boxes and then train YOLOv8 on top of those automatic annotations.
In this tutorial, we will walk through a complete pipeline that starts from horse-race videos and ends with a custom YOLOv8 detector running on new footage.

The code below is organized into six logical parts so you can follow the flow step by step.
You will set up a Conda environment, extract frames from video, run automatic image annotation with Autodistill and Grounding DINO, visualize the YOLO labels, train a YOLOv8 model, and finally test your detector on a separate race video.
Along the way, we will keep the focus on practical automatic image annotation so you can easily adapt the same pattern to your own projects.

Getting the environment ready for automatic image annotation

Before you touch any Python code, it is worth isolating the project in a dedicated Conda environment.
This makes your automatic image annotation workflow reproducible and keeps all the YOLOv8 and Autodistill dependencies in one clean place.
In this first block you create the environment, confirm CUDA, install PyTorch with GPU support, and add the key libraries you will use throughout the tutorial.

### Create a new Conda environment dedicated to automatic image annotation. conda create --name Autodistill python=3.8  ### Activate the environment so that all new packages are installed into this isolated space. conda activate Autodistill  ### Check the NVIDIA CUDA compiler version to confirm that CUDA is available on your system. nvcc --version  ### Install PyTorch, Torchvision, and Torchaudio with CUDA 11.8 support from the official PyTorch and NVIDIA channels. conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=11.8 -c pytorch -c nvidia  ### Install the Ultralytics package that provides the YOLOv8 implementation. pip install ultralytics==8.1.0  ### Install the Supervision library to handle listing files, reading video frames, and plotting image grids. pip install supervision==0.9.0  ### Install the Autodistill framework that powers automatic image annotation. pip install autodistill  ### Install the Grounding DINO integration that lets Autodistill use a vision-language model for detection. pip install autodistill_grounding_dino  ### Install scikit-learn for any utility functions or potential metrics you might want later. pip install scikit-learn  ### Install the Roboflow client library, which can help with dataset management if you choose to upload data there. pip install roboflow

After running these commands you have a self-contained environment that is ready for automatic image annotation with GPU-accelerated PyTorch, YOLOv8, Autodistill, and Grounding DINO.
If anything breaks later, you can always recreate the environment from this exact command list.

Turning horse-race videos into a clean image dataset

The next goal is to convert your horse-race videos into a set of still images that you can annotate automatically.
Automatic image annotation works on images, so this step bridges the gap between video and dataset.
Here you use Supervision to list video files, walk through each frame with a stride, save frames to disk, and then preview a sample grid of images to verify that the extraction looks good.

### Import the supervision library, which provides utilities for working with videos and images. import supervision as sv  ### Import tqdm's notebook-friendly progress bar in case you want to monitor long operations. from tqdm.notebook import tqdm  ### Define the directory where your source horse-race videos are stored. VIDEO_DIR_PATH = "C:/Data-sets/Horse-race/Source-Data/videos"  ### Define the directory where all extracted frames will be saved as images. IMAGE_DIR_PATH = "C:/Data-sets/Horse-race/Source-Data/All-images"  ### Set the frame stride so that every tenth frame is extracted from each video. FRAME_STRIDE = 10  ### Collect all video paths with .mov or .mp4 extensions from the source directory. video_paths = sv.list_files_with_extensions(directory=VIDEO_DIR_PATH, extensions=["mov", "mp4"])  ### Print the list of video paths to confirm that the videos were found correctly. print(video_paths)  ### Initialize a counter to keep track of how many images have been extracted. image_numerator = 0  ### Loop over every video path found in the directory. for video_path in video_paths:     ### Print the name of the current video being processed for easier tracking.     print("Extracted video name : " + str(video_path))      ### Extract the stem of the video path to use as a prefix for image file names.     video_name = video_path.stem      ### Build a pattern for naming images that includes the video name and a zero-padded index.     image_name_pattern = video_name + "-{:05d}.png"      ### Open an ImageSink context manager that will save frames into the target image directory.     with sv.ImageSink(target_dir_path=IMAGE_DIR_PATH, image_name_pattern=image_name_pattern) as sink:         ### Iterate over frames from the video using the chosen frame stride.         for image in sv.get_video_frames_generator(source_path=str(video_path), stride=FRAME_STRIDE):              ### Increment the image counter for each extracted frame.             image_numerator = image_numerator + 1              ### Print the index of the extracted frame so you can follow the progress in the console.             print("Extract image no. " + str(image_numerator))              ### Save the current frame as an image file using the sink.             sink.save_image(image=image)  ### Define the directory containing all extracted images so we can visualize them. IMAGE_DIR_PATH = "C:/Data-sets/Horse-race/Source-Data/All-images"  ### List all image paths with .png or .jpg extensions from the extracted frames directory. image_paths = sv.list_files_with_extensions(directory=IMAGE_DIR_PATH, extensions=["png", "jpg"])  ### Print the total number of images found so you understand the dataset size. print("image count : ", len(image_paths))  ### Set how many sample images you want to display in the grid. SAMPLE_SIZE = 16  ### Define the grid layout as four rows by four columns. SAMPLE_GRID_SIZE = (4, 4)  ### Define the overall plot size for the image grid in inches. SAMPLE_PLOT_SIZE = (16, 16)  ### Import OpenCV, which will be used to read images from disk. import cv2  ### Create a list of titles based on the stem of each image path for the first SAMPLE_SIZE images. titles = [     image_path.stem     for image_path in image_paths[:SAMPLE_SIZE] ]  ### Read the first SAMPLE_SIZE images from disk into memory using OpenCV. images = [     cv2.imread(str(image_path))     for image_path in image_paths[:SAMPLE_SIZE] ]  ### Plot the sample images in a grid so you can visually inspect the extracted frames. sv.plot_images_grid(images=images, titles=titles, grid_size=SAMPLE_GRID_SIZE, size=SAMPLE_PLOT_SIZE)

After this block you should have a folder of nicely named PNG or JPG images and a visual confirmation that the frames cover different moments of the race.
This gives your automatic image annotation step a solid and diverse set of inputs to work with.

Using Autodistill and Grounding DINO for automatic image annotation

Now that you have raw images, it is time to apply automatic image annotation.
Autodistill uses a powerful model like Grounding DINO behind the scenes to detect objects that match your text prompts and then saves the detections as YOLO-style labels.
In this section you define your ontology, configure thresholds, initialize the Grounding DINO base model, and run the annotation process over all extracted frames.

### Import CaptionOntology from Autodistill's detection module to define your label names. from autodistill.detection import CaptionOntology  ### Create a caption ontology mapping natural language prompts to class names for the horse race domain. ontology = CaptionOntology({     "horse race": "horse race",     "horse": "horse",     "horse in a race": "horse in a race",     "horse racing": "horse racing", })  ### Define the path to the folder containing all extracted image frames. IMAGE_DIR_PATH = "C:/Data-sets/Horse-race/Source-Data/All-images"  ### Define the root path where the auto-labeled dataset will be created. DATASET_DIR_PATH = "C:/Data-sets/Horse-race/dataset"  ### Set the detection box confidence threshold used by Grounding DINO. BOX_THRESHOLD = 0.6  ### Set the text confidence threshold for matching prompts to detections. TEXT_THRESHOLD = 0.50  ### Import the GroundingDINO base model wrapper used by Autodistill. from autodistill_grounding_dino import GroundingDINO  ### Initialize the GroundingDINO model with your ontology and chosen thresholds. base_model = GroundingDINO(     ontology=ontology,     box_threshold=BOX_THRESHOLD,     text_threshold=TEXT_THRESHOLD, )  ### Run automatic image annotation on all PNG images in the input folder and save YOLO labels into the dataset directory. dataset = base_model.label(     input_folder=IMAGE_DIR_PATH,     extension=".png",     output_folder=DATASET_DIR_PATH, )

When this block finishes, your automatic image annotation pipeline has generated YOLO-formatted label files under the dataset directory.
Autodistill and Grounding DINO have effectively turned raw horse-race frames into a structured dataset ready for YOLOv8 training.

Visualizing YOLO annotations to validate automatic labels

Automatic image annotation is powerful, but you still want to spot-check the results.
In this section you parse YOLO label files, convert normalized coordinates into pixel positions, draw bounding boxes and class names on top of the images, and display a handful of random examples.
This helps you quickly see whether the automatic annotations from Grounding DINO are accurate enough for training a strong YOLOv8 model.

### Import the os module for working with file paths and directory listings. import os  ### Import the random module to randomly sample images for visualization. import random  ### Import matplotlib for potential plotting needs, although here we mainly use OpenCV windows. import matplotlib.pyplot as plt  ### Import OpenCV to read images and draw rectangles and text on them. import cv2  ### Define the human-readable label names that correspond to YOLO class indices. label_names = ["horse race", "horse", "horse in a race", "horse racing"]  ### Define a function that reads a YOLO-format label file and returns a list of annotations. def get_annotations(original_img, label_file):     ### Open the label file and read all lines containing annotation data.     with open(label_file, 'r') as file:         lines = file.readlines()      ### Initialize an empty list that will hold tuples of (label_index, x_center, y_center, width, height).     annotations = []      ### Loop over each line in the label file.     for line in lines:         ### Split the line into separate values, starting with the label index.         values = line.split()         label = values[0]          ### Convert the remaining values from strings into floating-point numbers.         x, y, w, h = map(float, values[1:])         ### Append the parsed annotation as a tuple into the list.         annotations.append((label, x, y, w, h))      ### Return the full list of parsed annotations.     return annotations  ### Define a function that draws bounding boxes and labels on an image based on YOLO-format annotations. def put_annotations_in_image(image, annotations):     ### Extract the image height and width from the shape.     H, W, _ = image.shape      ### Loop over each annotation tuple in the list.     for annotation in annotations:         ### Unpack the label index and normalized YOLO coordinates.         label, x, y, w, h = annotation          ### Print the raw values for debugging purposes.         print(label, x, y, w, h)          ### Convert the label index to a human-readable label name.         label_name = label_names[int(label)]          ### Convert YOLO normalized coordinates into pixel coordinates for the top-left corner.         x1 = int((x - w / 2) * W)         y1 = int((y - h / 2) * H)          ### Convert YOLO normalized coordinates into pixel coordinates for the bottom-right corner.         x2 = int((x + w / 2) * W)         y2 = int((y + h / 2) * H)          ### Draw a rectangle around the detected object on the image.         cv2.rectangle(image, (x1, y1), (x2, y2), (200, 200, 0), 1)          ### Draw the label text just above the top-left corner of the bounding box.         cv2.putText(image, label_name, (x1, y1 - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (200, 200, 0), 2)      ### Return the annotated image so it can be displayed or saved.     return image  ### Define a function that displays several random images from a folder together with their annotations. def display_random_images(folder_path, num_images, label_folder):     ### Collect only file names from the folder path, skipping any subdirectories.     image_files = [f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))]      ### Randomly select the desired number of image file names.     selected_images = random.sample(image_files, num_images)      ### Loop over each selected image file for visualization.     for i, image_file in enumerate(selected_images):         ### Read the image from disk using OpenCV.         img = cv2.imread(os.path.join(folder_path, image_file))          ### Build the expected label file name by replacing the image extension with .txt.         label_file = os.path.splitext(image_file)[0] + ".txt"          ### Join the label folder path with the label file name to get a full path.         label_file_path = os.path.join(label_folder, label_file)          ### Read and parse the YOLO-format annotations for this image.         annotations_Yolo_format = get_annotations(img, label_file_path)          ### Draw bounding boxes and labels on a copy of the image based on the annotations.         image_with_annotations = put_annotations_in_image(img, annotations_Yolo_format)          ### Print the shape of the annotated image for quick verification.         print(image_with_annotations.shape)          ### Display the annotated image in an OpenCV window.         cv2.imshow("img no. " + str(i), image_with_annotations)          ### Wait for a key press before moving on to the next image.         cv2.waitKey(0)  ### Define the path to the training images generated by the automatic image annotation step. images_path = "C:/Data-sets/Horse-race/dataset/train/images"  ### Define the path to the YOLO label files associated with the training images. label_folder = "C:/Data-sets/Horse-race/dataset/train/labels"  ### Set how many random images you want to display with annotations. num_images = 8  ### Run the visualization function to inspect random annotated training images. display_random_images(images_path, num_images, label_folder)

After checking a few randomly annotated images, you will quickly see whether the automatic image annotation step is doing a good job.
If the boxes are mostly tight and labels make sense, you can move on to training YOLOv8 with confidence.

Training a YOLOv8 model on auto-labeled horse-race data

With a validated dataset, the next step is to train YOLOv8 on the auto-labeled images.
In this section you configure the YOLOv8 model, point it at a data.yaml file that describes the dataset, and run training with sensible defaults for epochs, batch size, and patience.
The result is a set of checkpoints, including a best.pt file that you will later use for inference.

### Import the YOLO class from the Ultralytics library to handle training and inference. from ultralytics import YOLO  ### Define a main function to encapsulate the training logic. def main():     ### Load a YOLOv8 large model configuration from the provided YAML file.     model = YOLO("yolov8l.yaml")      ### Define the path to the data configuration file that describes your dataset.     config_file_path = "Best-Object-Detection-models/Yolo-V8/Auto-Annotation-YoloV8-Detecting-horses/data.yaml"      ### Define the directory where YOLOv8 will store training runs and checkpoints.     project = "C:/Data-sets/Horse-race/dataset/checkpoints"      ### Set the experiment name to distinguish this run from others in the project folder.     experiment = "My-Large-Model"      ### Choose a batch size that fits your GPU memory and training speed.     batch_size = 16      ### Start training the YOLOv8 model with your custom data and chosen hyperparameters.     results = model.train(         data=config_file_path,         epochs=100,         project=project,         name=experiment,         batch=batch_size,         device=0,         patience=40,         imgsz=640,         verbose=True,         val=True,     )  ### Ensure that the main function only runs when this script is executed directly. if __name__ == "__main__":     main()

Here is the data.yaml file that connects YOLOv8 to the images and labels generated by your automatic image annotation step.

train: C:/Data-sets/Horse-race/dataset/train/images val: C:/Data-sets/Horse-race/dataset/valid/images  # class names nc: 4 names: ["horse race" , "horse" , "horse in a race", "horse racing"]

When training finishes, YOLOv8 saves the best model weights into the checkpoints folder.
These weights encode what the model has learned from your automatically annotated horse-race dataset and are ready to be used for video inference.

Running real-time horse-race detection on new videos

The final piece of the pipeline is to take your trained YOLOv8 model and run it on a brand-new race video.
You use OpenCV to read frames from the video, feed each frame through YOLOv8, draw bounding boxes and class names for any detected horses or race scenes, and show the results in a live window.
This is where your automatic image annotation work turns into something visual and satisfying.

### Import OpenCV so you can read video frames and draw visual annotations. import cv2  ### Import the YOLO class from Ultralytics to load the trained model. from ultralytics import YOLO  ### Import the os module to help build a path to the best model weights. import os  ### Define the path to the test video you want to run the trained model on. video_path = "C:/Data-sets/Horse-race/Source-Data/Test-videos/video12.mp4"  ### Build the full path to the best model checkpoint saved during training. model_path = os.path.join(     "C:/Data-sets/Horse-race/dataset/checkpoints",     "My-Large-Model",     "weights",     "best.pt", )  ### Load the trained YOLOv8 model from the checkpoint file. model = YOLO(model_path)  ### Set a confidence threshold so that only sufficiently strong detections are drawn. threshold = 0.25  ### Create an OpenCV VideoCapture object to read frames from the test video. cap = cv2.VideoCapture(video_path)  ### Loop over frames from the video until there are no more frames to read. while True:     ### Read the next frame from the VideoCapture object.     ret, frame = cap.read()      ### If no frame was read, break out of the loop because the video has finished.     if not ret:         break      ### Run the YOLOv8 model on the current frame and take the first result object.     results = model(frame)[0]      ### Loop over each detected box in the YOLOv8 results.     for result in results.boxes.data.tolist():         ### Unpack the bounding box coordinates, score, and class ID from the result.         x1, y1, x2, y2, score, class_id = result          ### Convert floating-point coordinates into integer pixel positions.         x1 = int(x1)         x2 = int(x2)         y1 = int(y1)         y2 = int(y2)          ### Only draw boxes for detections whose confidence score is above the chosen threshold.         if score > threshold:             ### Draw a green rectangle around the detected object.             cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 1)              ### Draw the class name text just above the bounding box.             cv2.putText(                 frame,                 results.names[int(class_id)].upper(),                 (x1, y1 - 10),                 cv2.FONT_HERSHEY_SIMPLEX,                 0.5,                 (0, 255, 0),                 1,             )      ### Display the annotated frame in a window titled "Video".     cv2.imshow("Video", frame)      ### Check for a key press and break if the 'q' key is pressed.     if cv2.waitKey(25) & 0xFF == ord("q"):         break  ### Release the VideoCapture resource after the loop ends. cap.release()  ### Close any OpenCV windows that were opened during visualization. cv2.destroyAllWindows()

When you run this script, you will see bounding boxes tracking horses and race scenes throughout the video.
This closes the loop from raw video to automatic image annotation to a trained YOLOv8 model performing live detection.

FAQ: Automatic image annotation with Autodistill and YOLOv8

What does automatic image annotation mean in this project?

In this project, automatic image annotation means using Autodistill with Grounding DINO to detect objects in images and automatically write YOLO-style label files, so you can skip manual box drawing.

Why are horse-race videos converted into frames first?

Converting videos into frames turns them into a standard image dataset, which makes it easier to run automatic annotation, manage training splits, and debug individual examples.

How does Grounding DINO help with labeling?

Grounding DINO links text prompts to visual regions, so when you provide phrases like “horse racing” it can detect matching objects and return bounding boxes that Autodistill converts into labels.

What is the purpose of the ontology in Autodistill?

The ontology defines the list of classes and their associated prompts, ensuring that all detections are mapped to consistent class IDs and names for YOLOv8 training.

How do I check if the automatic labels are reliable?

You can use the visualization code to overlay bounding boxes on random images and visually inspect whether the objects and labels align with what you see in the frame.

Why is YOLOv8 used for training instead of another model?

YOLOv8 offers a simple training API, strong performance, and easy deployment to images and videos, making it a practical choice once your automatic image annotation dataset is ready.

Can I adapt this pipeline to detect different types of objects?

Yes, you can change the ontology prompts, adjust the dataset paths, and retrain YOLOv8 to detect any objects that Grounding DINO can reliably identify from your images.

Do I need to modify the data.yaml file when I change classes?

Whenever you change the set of classes or dataset folder structure, you should update the data.yaml file so that YOLOv8 knows the correct paths, class count, and class names.

Is a GPU required for training this model efficiently?

A modern GPU is highly recommended because it significantly speeds up both automatic labeling and YOLOv8 training, especially when working with many video frames.

What are some good next steps after completing this tutorial?

After you complete this tutorial, you can scale up to more videos, experiment with different YOLOv8 model sizes, or integrate segmentation-based auto-labeling to capture richer object details.

Wrapping up the automatic image annotation pipeline

By the time you reach the end of this tutorial, you have walked through a complete pipeline that turns raw horse-race videos into a working object detector.
Automatic image annotation with Autodistill and Grounding DINO replaced hours of manual labeling with a fast, scriptable process that still leaves room for human review where it matters most.

You saw how to extract frames with Supervision, define an ontology for your classes, and let Grounding DINO generate high-quality YOLO labels.
You validated those labels visually, trained a YOLOv8 model on the resulting dataset, and then watched the detector track horses on a new race video in real time.

The same pattern scales to other domains with only minor changes.
By updating your prompts, dataset paths, and data.yaml file, you can reuse this automatic image annotation workflow for traffic analysis, sports analytics, industrial inspection, or any other problem where labeled images are hard to obtain.

Most importantly, you now have a repeatable system.
Whenever you collect new video, you can run it through the same steps to grow your dataset, retrain YOLOv8, and keep your model aligned with the real world.
That is the real power of combining automatic image annotation with a modern detector like YOLOv8.

Connect :

☕ Buy me a coffee — https://ko-fi.com/eranfeit

🖥️ Email : feitgemel@gmail.com

🌐 https://eranfeit.net

🤝 Fiverr : https://www.fiverr.com/s/mB3Pbb

Enjoy,

Eran