Introduction
In this tutorial, you will build a voice-controlled, real-time object detection app in Python using OpenCV DNN and YOLOv4-tiny, powered by Speech Recognition for speech-to-text.
The goal is to let you speak the name of an object and instantly highlight matching detections on the live camera feed.
This approach blends real time object detection python with speech recognition python to create an interactive computer vision experience that feels natural and responsive.
You will learn how to load YOLOv4-tiny with OpenCV, capture and transcribe audio, parse class names, and conditionally draw bounding boxes when a spoken keyword matches a detected label.
The code is divided into three clean parts for setup, voice recording, and the main detection loop so you can copy, paste, and run quickly.
All sections are optimized for the chosen keywords to help your blog post reach developers searching for opencv yolo object detection and yolov4 tiny python guides.
You can find more similar tutorials in my blog posts page here : https://eranfeit.net/blog/
You can find the full code here : https://ko-fi.com/s/90dee146e6
You can find the video here : https://www.youtube.com/watch?v=fd1msoIpM5Q
Instructions :
conda create -n DetectObejctByAudio python=3.8 conda activate DetectObejctByAudio pip install opencv-python pip install opencv-contrib-python pip install pandas pip install sounddevice pip install scipy pip install soundfile pip install SpeechRecognition # Download Yolo4-tiny model # cfg file # https://raw.githubusercontent.com/AlexeyAB/darknet/master/cfg/yolov4-tiny.cfg # weights file #https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/yolov4-tiny.weights # classes.txt is in the my repo . it is based on COCO dataset
You can find the full code here : https://ko-fi.com/s/90dee146e6
Here is the file classes.txt
person bicycle car motorbike aeroplane bus train truck boat traffic light fire hydrant stop sign parking meter bench bird cat dog horse sheep cow elephant bear zebra giraffe backpack umbrella handbag tie suitcase frisbee skis snowboard sports ball kite baseball bat baseball glove skateboard surfboard tennis racket bottle wine glass cup fork knife spoon bowl banana apple sandwich orange broccoli carrot hot dog pizza donut cake chair sofa pottedplant bed diningtable toilet tvmonitor laptop mouse remote keyboard cell phone microwave oven toaster sink refrigerator book clock vase scissors teddy bear hair drier toothbrush
You can find the full code here : https://ko-fi.com/s/90dee146e6
Part 1 : OpenCV, YOLOv4-tiny, and Class List
This section prepares all dependencies, loads the YOLOv4-tiny model into OpenCV’s DNN module, configures input size and scale, and reads the classes.txt
file to map detections to human-readable labels.
It establishes the foundation for real-time object detection in Python and ensures class names are available for your voice filter.
### Import OpenCV for computer vision operations. import cv2 ### Import pandas to conveniently read the class names file as a table. import pandas as pd ### Import sounddevice for recording audio from the microphone. import sounddevice as sd # for the record ### Import write from scipy.io.wavfile to save recorded audio as WAV. from scipy.io.wavfile import write # to save the file ### Import NumPy for fast numerical array operations. import numpy as np ### Import soundfile to convert audio encodings when needed. import soundfile # for converting the audio format ### Import SpeechRecognition to transcribe recorded audio to text. import speech_recognition as sr # for speech to text ### Load the YOLOv4-tiny model weights and config into OpenCV’s DNN. net = cv2.dnn.readNet("C:/GitHub/Open-CV/DetectByAudio/yolov4-tiny.weights", "C:/GitHub/Open-CV/DetectByAudio/yolov4-tiny.cfg") ### Wrap the network in DetectionModel for simple detect() calls. model = cv2.dnn_DetectionModel(net) ### Set input size and scale so frames are preprocessed correctly for YOLOv4-tiny. model.setInputParams(size=(416, 416), scale=1/255) ### Prepare a list to hold class names in the same order as the model’s outputs. classesNames = [] ### Read the classes file (one class per line) using pandas. df = pd.read_csv("DetectByAudio/classes.txt", header=None, names=["ClassName"]) ### Iterate over rows to append each class name to the list. for index, row in df.iterrows(): ### Fetch the current class name by index from the DataFrame. ClassName = df.iloc[index]['ClassName'] ### Store the class name so we can label detections later. classesNames.append(ClassName) ### Optionally inspect the loaded classes during development. # print(classesNames) ### Open the default camera for real-time video capture. cap = cv2.VideoCapture(0) ### Set desired capture width for the live stream window. cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280) ### Set desired capture height for the live stream window. cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720) ### Define top-left corner and bottom-right corner for a clickable “record” button. x1 = 20 y1 = 20 x2 = 570 y2 = 90 ### Sampling rate in Hz for audio recording. fs = 44100 # audio rate ### Duration in seconds for the voice snippet to record. secods = 3 # duration ### Path where the raw recorded audio file will be saved. audioFileName = "c:/temp/output.wav" ### Flag that indicates whether we should highlight matches based on voice command. ButtonFlag = False ### Stores the latest transcribed text from the microphone (“what to look for”). LookForThisClassName = ""
You can find the full code here : https://ko-fi.com/s/90dee146e6
You imported all libraries, loaded YOLOv4-tiny into OpenCV’s DNN, read class labels, opened the webcam, and defined UI and audio parameters.
This primes the pipeline for high-speed opencv yolo object detection with subsequent voice control.
Part 2 – Voice Command Recording via Mouse Click
Here you build an interactive button overlay.
When the user left-clicks inside the button area, the app records a short audio clip, saves it, and converts it to a SpeechRecognition-friendly format.
The recognized text becomes the filter term for highlighting detections.
### Define a mouse callback that records audio upon clicking inside the button area. def recordAudioByMouseClick(event, x, y, flags, params): ### Declare that we will modify the global flags inside this function. global ButtonFlag global LookForThisClassName ### If the left mouse button was pressed, check whether it is inside the button region. if event == cv2.EVENT_LBUTTONDOWN: ### Verify the click lies within the button’s bounding box. if x1 <= x <= x2 and y1 <= y <= y2: ### Provide console feedback for debugging. print("Click inside the button") ### Record a stereo audio snippet for the configured duration at the given sampling rate. myrecording = sd.rec(int(secods * fs), samplerate=fs, channels=2) ### Block until recording is complete so we can safely save the file. sd.wait() # wait until the recording is finished ### Write the recorded audio to a WAV file for later processing. write(audioFileName, fs, myrecording) # save the audio file ### Run speech-to-text on the newly recorded audio and store the transcribed text. LookForThisClassName = getTextFromAudio() ### Turn on the filter flag so matches will be highlighted during detection. if ButtonFlag is False: ButtonFlag = True ### If the click is outside the button, disable the filtering behavior for clarity. else: print("Click outside the button") ButtonFlag = False
You can find the full code here : https://ko-fi.com/s/90dee146e6
You created a mouse-driven recorder that captures audio on demand and updates the global state with the transcribed phrase to search for.
This powers the voice command object detection experience.
Real-Time Detection Loop and Speech-Driven Highlighting
This part converts audio to text, registers the mouse handler, and runs the main detection loop.
Each frame is fed to YOLOv4-tiny.
When a detection’s class name appears in your spoken text, the code highlights that object with a bounding box and label.
A visual “Record 3 seconds” button is drawn onto the frame for intuitive interaction.
### Convert recorded audio into a 16-bit PCM WAV and transcribe it using Google’s recognizer. def getTextFromAudio(): ### Read the recorded audio file with soundfile to inspect data and sample rate. data, samplerate = soundfile.read(audioFileName) ### Re-encode the audio as 16-bit PCM which SpeechRecognition expects for best compatibility. soundfile.write('c:/temp/outputNew.wav', data, samplerate, subtype='PCM_16') ### Create a recognizer instance to handle speech-to-text inference. recognizer = sr.Recognizer() ### Wrap the converted WAV in an AudioFile so the recognizer can read it. jackhammer = sr.AudioFile('c:/temp/outputNew.wav') ### Open the audio source context and load the entire clip into memory. with jackhammer as source: audio = recognizer.record(source) ### Use the default Google Web Speech API backend to recognize spoken words. result = recognizer.recognize_google(audio) ### Print the transcription for visibility and debugging. print(result) ### Return the recognized text to the caller so the UI can use it as a filter. return result ### Create a named window that will serve as the display target for frames and UI overlays. cv2.namedWindow("Frame") # set the same name ### Attach the mouse callback so clicks over the window trigger recording logic. cv2.setMouseCallback("Frame", recordAudioByMouseClick) ### Start the main application loop to process frames until the user exits. while True: ### Read a frame from the capture device; rtn indicates success. rtn, frame = cap.read() ### Run object detection on the current frame to obtain class IDs, confidence scores, and boxes. (class_ids, scores, bboxes) = model.detect(frame) ### You can inspect raw results during development if needed. # print("Class ids:", class_ids) # print("Scores :", scores) # print("Bboxes :", bboxes) ### Iterate over parallel lists of detections to draw and label regions of interest. for class_id, score, bbox in zip(class_ids, scores, bboxes): ### Unpack the bounding box as x, y for top-left and width, height for size. x, y, width, height = bbox # x, y is the left upper corner ### Retrieve the human-readable class label for the detected object. name = classesNames[class_id] ### Check if the spoken text contains this class label as a substring. index = LookForThisClassName.find(name) # look for the text inside a sring ### If filtering is enabled and the label appears in the transcription, highlight the box. if ButtonFlag is True and index > 0: ### Draw a rectangle around the matched detection with a custom color and thickness. cv2.rectangle(frame, (x, y), (x + width, y + height), (130, 50, 50), 3) ### Put the class name just above the box for readability. cv2.putText(frame, name, (x, y - 10), cv2.FONT_HERSHEY_COMPLEX, 1, (120, 50, 50), 2) ### Draw a filled UI button prompting the user to click and record a 3-second snippet. cv2.rectangle(frame, (x1, y1), (x2, y2), (153, 0, 0), -1) #-1 is filled cretangle ### Render readable button text to guide the user interaction flow. cv2.putText(frame, "Click for record - 3 seconds", (40, 60), cv2.FONT_HERSHEY_COMPLEX, 1, (255, 255, 255), 2) # white color ### Show the annotated frame in the display window named “Frame”. cv2.imshow("Frame", frame) ### Allow the user to quit the loop by pressing 'q' on the keyboard. if cv2.waitKey(1) == ord('q'): break ### Release the camera resource once the loop ends to free the device. cap.release() ### Destroy any OpenCV windows that were created during execution. cv2.destroyAllWindows()
You can find the full code here : https://ko-fi.com/s/90dee146e6
You converted the audio to a compatible format, transcribed the text, and ran the live detection loop.
When your speech includes a class label, the app draws a box and label around matching objects, delivering an engaging, real-time object detection python demo controlled entirely by your voice.
Connect :
☕ Buy me a coffee — https://ko-fi.com/eranfeit
🖥️ Email : feitgemel@gmail.com
🤝 Fiverr : https://www.fiverr.com/s/mB3Pbb
Enjoy,
Eran