Real-Time Object Detection in Python with Voice Commands (OpenCV + YOLOv4-tiny)

Last Updated on 23/10/2025 by Eran Feit

You can find the video here : https://www.youtube.com/watch?v=fd1msoIpM5Q

What you’ll build

A Python app that listens for your voice command (for example, “person”, “bottle”, “dog”) and highlights only those objects in the webcam stream. We’ll use:

OpenCV DNN to run YOLOv4-tiny in real time
SpeechRecognition + sounddevice to record and transcribe audio
A simple UI overlay button to record a 3-second clip on click

By the end, you’ll speak a class name and the app will box just those detections.

You can find more similar tutorials in my blog posts page here : https://eranfeit.net/blog/

You can find the full code here : https://ko-fi.com/s/90dee146e6

Prerequisites

Python 3.8+ (Conda recommended)
Webcam
Basic familiarity with OpenCV and Python

1) Environment & dependencies

# Create and activate environment
conda create -n DetectObejctByAudio python=3.8 -y
conda activate DetectObejctByAudio

# Core libraries
pip install opencv-python opencv-contrib-python numpy pandas
pip install sounddevice soundfile scipy SpeechRecognition

Download YOLOv4-tiny files (cfg & weights) and keep them together:

yolov4-tiny.cfg:
https://raw.githubusercontent.com/AlexeyAB/darknet/master/cfg/yolov4-tiny.cfg
yolov4-tiny.weights:
https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/yolov4-tiny.weights

Create classes.txt with COCO labels (one per line). You can paste the full list below.

2) COCO classes file (`classes.txt`)

person
bicycle
car
motorbike
aeroplane
bus
train
truck
boat
traffic light
fire hydrant
stop sign
parking meter
bench
bird
cat
dog
horse
sheep
cow
elephant
bear
zebra
giraffe
backpack
umbrella
handbag
tie
suitcase
frisbee
skis
snowboard
sports ball
kite
baseball bat
baseball glove
skateboard
surfboard
tennis racket
bottle
wine glass
cup
fork
knife
spoon
bowl
banana
apple
sandwich
orange
broccoli
carrot
hot dog
pizza
donut
cake
chair
sofa
pottedplant
bed
diningtable
toilet
tvmonitor
laptop
mouse
remote
keyboard
cell phone
microwave
oven
toaster
sink
refrigerator
book
clock
vase
scissors
teddy bear
hair drier
toothbrush

You can find the full code here : https://ko-fi.com/s/90dee146e6

3) Part A — Model & camera setup

This section prepares all dependencies, loads the YOLOv4-tiny model into OpenCV’s DNN module, configures input size and scale, and reads the classes.txt file to map detections to human-readable labels.
It establishes the foundation for real-time object detection in Python and ensures class names are available for your voice filter.

### Import OpenCV for computer vision operations.
import cv2

### Import pandas to conveniently read the class names file as a table.
import pandas as pd

### Import sounddevice for recording audio from the microphone.
import sounddevice as sd  # for the record

### Import write from scipy.io.wavfile to save recorded audio as WAV.
from scipy.io.wavfile import write  # to save the file

### Import NumPy for fast numerical array operations.
import numpy as np

### Import soundfile to convert audio encodings when needed.
import soundfile  # for converting the audio format

### Import SpeechRecognition to transcribe recorded audio to text.
import speech_recognition as sr  # for speech to text



### Load the YOLOv4-tiny model weights and config into OpenCV’s DNN.
net = cv2.dnn.readNet("C:/GitHub/Open-CV/DetectByAudio/yolov4-tiny.weights", "C:/GitHub/Open-CV/DetectByAudio/yolov4-tiny.cfg")

### Wrap the network in DetectionModel for simple detect() calls.
model = cv2.dnn_DetectionModel(net)

### Set input size and scale so frames are preprocessed correctly for YOLOv4-tiny.
model.setInputParams(size=(416, 416), scale=1/255)


### Prepare a list to hold class names in the same order as the model’s outputs.
classesNames = []

### Read the classes file (one class per line) using pandas.
df = pd.read_csv("DetectByAudio/classes.txt", header=None, names=["ClassName"])

### Iterate over rows to append each class name to the list.
for index, row in df.iterrows():
    ### Fetch the current class name by index from the DataFrame.
    ClassName = df.iloc[index]['ClassName']
    ### Store the class name so we can label detections later.
    classesNames.append(ClassName)

### Optionally inspect the loaded classes during development.
# print(classesNames)


### Open the default camera for real-time video capture.
cap = cv2.VideoCapture(0)

### Set desired capture width for the live stream window.
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)

### Set desired capture height for the live stream window.
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)


### Define top-left corner and bottom-right corner for a clickable “record” button.
x1 = 20
y1 = 20
x2 = 570
y2 = 90

### Sampling rate in Hz for audio recording.
fs = 44100  # audio rate

### Duration in seconds for the voice snippet to record.
secods = 3  # duration

### Path where the raw recorded audio file will be saved.
audioFileName = "c:/temp/output.wav"

### Flag that indicates whether we should highlight matches based on voice command.
ButtonFlag = False

### Stores the latest transcribed text from the microphone (“what to look for”).
LookForThisClassName = ""

You can find the full code here : https://ko-fi.com/s/90dee146e6

You imported all libraries, loaded YOLOv4-tiny into OpenCV’s DNN, read class labels, opened the webcam, and defined UI and audio parameters.
This primes the pipeline for high-speed opencv yolo object detection with subsequent voice control.

4) Part B — Click-to-record voice command (3 seconds)

Here you build an interactive button overlay.
When the user left-clicks inside the button area, the app records a short audio clip, saves it, and converts it to a Speech Recognition-friendly format.
The recognized text becomes the filter term for highlighting detections.

### Define a mouse callback that records audio upon clicking inside the button area.
def recordAudioByMouseClick(event, x, y, flags, params):

    ### Declare that we will modify the global flags inside this function.
    global ButtonFlag
    global LookForThisClassName

    ### If the left mouse button was pressed, check whether it is inside the button region.
    if event == cv2.EVENT_LBUTTONDOWN:
        ### Verify the click lies within the button’s bounding box.
        if x1 <= x <= x2 and y1 <= y <= y2:
            ### Provide console feedback for debugging.
            print("Click inside the button")

            ### Record a stereo audio snippet for the configured duration at the given sampling rate.
            myrecording = sd.rec(int(secods * fs), samplerate=fs, channels=2)

            ### Block until recording is complete so we can safely save the file.
            sd.wait()  # wait until the recording is finished

            ### Write the recorded audio to a WAV file for later processing.
            write(audioFileName, fs, myrecording)  # save the audio file
            
            ### Run speech-to-text on the newly recorded audio and store the transcribed text.
            LookForThisClassName = getTextFromAudio()

            ### Turn on the filter flag so matches will be highlighted during detection.
            if ButtonFlag is False:
                ButtonFlag = True

        ### If the click is outside the button, disable the filtering behavior for clarity.
        else:
            print("Click outside the button")
            ButtonFlag = False

You can find the full code here : https://ko-fi.com/s/90dee146e6

You created a mouse-driven recorder that captures audio on demand and updates the global state with the transcribed phrase to search for.
This powers the voice command object detection experience.

5) Part C — Main detection loop with voice filter

This part converts audio to text, registers the mouse handler, and runs the main detection loop.
Each frame is fed to YOLOv4-tiny.
When a detection’s class name appears in your spoken text, the code highlights that object with a bounding box and label.
A visual “Record 3 seconds” button is drawn onto the frame for intuitive interaction.

### Convert recorded audio into a 16-bit PCM WAV and transcribe it using Google’s recognizer.
def getTextFromAudio():
    ### Read the recorded audio file with soundfile to inspect data and sample rate.
    data, samplerate = soundfile.read(audioFileName)

    ### Re-encode the audio as 16-bit PCM which SpeechRecognition expects for best compatibility.
    soundfile.write('c:/temp/outputNew.wav', data, samplerate, subtype='PCM_16')

    ### Create a recognizer instance to handle speech-to-text inference.
    recognizer = sr.Recognizer()

    ### Wrap the converted WAV in an AudioFile so the recognizer can read it.
    jackhammer = sr.AudioFile('c:/temp/outputNew.wav')

    ### Open the audio source context and load the entire clip into memory.
    with jackhammer as source:
        audio = recognizer.record(source)
    
    ### Use the default Google Web Speech API backend to recognize spoken words.
    result = recognizer.recognize_google(audio)

    ### Print the transcription for visibility and debugging.
    print(result)

    ### Return the recognized text to the caller so the UI can use it as a filter.
    return result



### Create a named window that will serve as the display target for frames and UI overlays.
cv2.namedWindow("Frame")  # set the same name

### Attach the mouse callback so clicks over the window trigger recording logic.
cv2.setMouseCallback("Frame", recordAudioByMouseClick) 
 

### Start the main application loop to process frames until the user exits.
while True:
    ### Read a frame from the capture device; rtn indicates success.
    rtn, frame = cap.read()

    ### Run object detection on the current frame to obtain class IDs, confidence scores, and boxes.
    (class_ids, scores, bboxes) = model.detect(frame)
    ### You can inspect raw results during development if needed.
    # print("Class ids:", class_ids)
    # print("Scores :", scores)
    # print("Bboxes :", bboxes)

    ### Iterate over parallel lists of detections to draw and label regions of interest.
    for class_id, score, bbox in zip(class_ids, scores, bboxes):
        ### Unpack the bounding box as x, y for top-left and width, height for size.
        x, y, width, height = bbox  # x, y is the left upper corner

        ### Retrieve the human-readable class label for the detected object.
        name = classesNames[class_id]

        ### Check if the spoken text contains this class label as a substring.
        index = LookForThisClassName.find(name)  # look for the text inside a sring

        ### If filtering is enabled and the label appears in the transcription, highlight the box.
        if ButtonFlag is True and index > 0:
            ### Draw a rectangle around the matched detection with a custom color and thickness.
            cv2.rectangle(frame, (x, y), (x + width, y + height), (130, 50, 50), 3)

            ### Put the class name just above the box for readability.
            cv2.putText(frame, name, (x, y - 10), cv2.FONT_HERSHEY_COMPLEX, 1, (120, 50, 50), 2)

    ### Draw a filled UI button prompting the user to click and record a 3-second snippet.
    cv2.rectangle(frame, (x1, y1), (x2, y2), (153, 0, 0), -1)  #-1 is filled cretangle

    ### Render readable button text to guide the user interaction flow.
    cv2.putText(frame, "Click for record - 3 seconds", (40, 60), cv2.FONT_HERSHEY_COMPLEX, 1, (255, 255, 255), 2)  # white color

    ### Show the annotated frame in the display window named “Frame”.
    cv2.imshow("Frame", frame)

    ### Allow the user to quit the loop by pressing 'q' on the keyboard.
    if cv2.waitKey(1) == ord('q'):
        break


### Release the camera resource once the loop ends to free the device.
cap.release()

### Destroy any OpenCV windows that were created during execution.
cv2.destroyAllWindows()

You can find the full code here : https://ko-fi.com/s/90dee146e6

You converted the audio to a compatible format, transcribed the text, and ran the live detection loop.
When your speech includes a class label, the app draws a box and label around matching objects, delivering an engaging, real-time object detection python demo controlled entirely by your voice.

Troubleshooting

Nothing gets highlighted after I speak.
Try a class that’s definitely in view (e.g., person). Confirm your microphone input and that spoken_text prints the term you expect.
Slow detections.
Reduce frame size (e.g., 960×540) or switch to a smaller input (320×320) for the DNN.
Permissions / audio errors.
On macOS and Windows, allow microphone access for your terminal/IDE.
Weights/config paths.
Make sure the cfg and weights paths are correct and accessible.

FAQs

Is YOLOv4-tiny fast enough for real-time on CPU?
Yes on many machines at 416×416; if it’s borderline, lower the DNN input or frame size.

Can I use another recognizer?
Yes — Vosk (offline) or Azure/Google Cloud STT (keys required) are common alternatives.

Related tutorials :

Connect :

☕ Buy me a coffee — https://ko-fi.com/eranfeit

🖥️ Email : feitgemel@gmail.com

🌐 https://eranfeit.net

🤝 Fiverr : https://www.fiverr.com/s/mB3Pbb

Planning a trip and want ideas you can copy fast?
Here are three detailed guides from our travels:

• 5-Day Ireland Itinerary: Cliffs, Castles, Pubs & Wild Atlantic Views
https://eranfeit.net/unforgettable-trip-to-ireland-full-itinerary/

• My Kraków Travel Guide: Best Places to Eat, Stay & Explore
https://eranfeit.net/my-krakow-travel-guide-best-places-to-eat-stay-explore/

• Northern Greece: Athens, Meteora, Tzoumerka, Ioannina & Nafpaktos (7 Days)
https://eranfeit.net/my-amazing-trip-to-greece/

Each guide includes maps, practical tips, and family-friendly stops—so you can plan in minutes, not hours.

Enjoy,

Eran