...

How to Extract Text from Scanned Books with Python and OpenCV

OpenCV Word Segmentation

Introduction

This tutorial shows how to perform OpenCV word segmentation in Python to prepare scanned pages for OCR.
You will learn to isolate text with thresholding, smooth noise with Gaussian blur, merge characters into words using dilation, and detect text regions with contours.


We first segment lines, then segment words within each line, and finally draw bounding boxes on the original page.
This approach is fast, reproducible, and ideal as an OCR preprocessing stage for Tesseract or any OCR engine.
By the end, you will have a full, copy-paste Python script and understand every command that makes it work.

check out our video here : https://youtu.be/c61w6H8pdzs&list=UULFTiWJJhaH6BviSWKLJUM9sg

You can find the code here : https://ko-fi.com/s/d621f2eb2c

You can find more similar tutorials in my blog posts page here : https://eranfeit.net/blog/


Full Python Code with Step-by-Step Explanation

### Import OpenCV for image processing and NumPy for array operations. import cv2 import numpy as np  ### Read the input document image from disk. ### Provide a proper path to a scanned page or book photo. img = cv2.imread('Open-CV/Words-Segmentation/book2.jpg')  ### Define a function that prepares a binary mask where text is white on black. ### Steps: convert to grayscale, invert-binary threshold to flip text to white, smooth with Gaussian blur, and re-threshold to clean edges. def thresholding(image):     ### Convert the BGR image to grayscale to simplify intensity operations.     img_gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)      ### Apply inverse binary threshold so darker text becomes white.     ### Values below 170 turn to 255 (white), others to 0 (black).     # We need the words in white      ret, thresh = cv2.threshold(img_gray, 170, 255 , cv2.THRESH_BINARY_INV)          ### Smooth the mask to merge small gaps between characters for stronger word blobs.     # wee need the "merge" the characters to a single "words"     thresh = cv2.GaussianBlur(thresh, (11,11), 0)      ### Re-threshold the blurred mask to get a crisp binary image again.     ret, thresh = cv2.threshold(thresh, 130 , 255 , cv2.THRESH_BINARY)      ### Return the clean binary mask where words are white.     return thresh  ### Create the thresholded mask of the input image for downstream segmentation. thresh_img = thresholding(img)  ### Visualize the binary mask to verify the text is white and background is black. cv2.imshow("thresh_img",thresh_img)  ### Prepare containers and a rectangular kernel to detect whole text lines. # line sementation linesArray = [] kernelRows = np.ones((5,40), np.uint8)  ### Dilate horizontally to connect characters and words into long line components. # We will use dilation for the line sementation dilated = cv2.dilate(thresh_img, kernelRows, iterations=1)  ### Preview the dilated image to see connected line regions. cv2.imshow("dilated",dilated)  ### Extract external contours which should correspond to line blobs after dilation. # find contours (contoursRows , heirarchy) = cv2.findContours(dilated.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)  ### Iterate over detected contours, filter by area, and store bounding boxes for candidate lines. #cv2.drawContours(img, contoursRows, -1 , (0,255,0), 2) # draw the contour around each row on the original image # loop inside the contours and draw rectangle for row in contoursRows :     ### Compute contour area to filter out small noise.     area = cv2.contourArea(row)      ### Keep only sufficiently large components which likely represent text lines.     if area > 500 :         ### Compute the bounding rectangle of the contour.         x , y, w, h = cv2.boundingRect(row)          ### Optionally draw the line rectangle on the original image for inspection.         #cv2.rectangle(img, (x,y), (x+w, y+h), (40,100,250), 2)          ### Save line bounding box for sorted processing.         linesArray.append([x , y, w, h])  ### Print how many line candidates were found as a quick sanity check. print(len(linesArray)) # 33 lines  ### Sort lines from top to bottom using the y coordinate. # lets sort the line by the y position (from up to down) sortedLinesArray = sorted(linesArray, key=lambda line : line [1])  ### Prepare structures for word extraction and indexing. # words segmentation words = [] lineNumber = 0 all = []  ### Create a square kernel to connect characters into words without bridging adjacent lines. # kernel for words kernelWords = np.ones((7,7), np.uint8)  ### Dilate the thresholded image with the word kernel to merge letters into word-sized blobs. dilateWordsImg = cv2.dilate(thresh_img, kernelWords, iterations=1)  ### Show the dilation result used for word segmentation. cv2.imshow("dilate Words Img",dilateWordsImg)  ### For each detected line, focus on its region and find word contours inside it. for line in sortedLinesArray :      ### Unpack line rectangle.     x,y,w,h = line      ### Crop the region of interest for the current line to avoid cross-line interference.     roi_line = dilateWordsImg[y: y+h , x:x+w]      ### Optionally draw the line on the original image for debugging.     #cv2.rectangle(img, (x,y), (x+w, y+h), (0,255,0),2 )      ### Find external contours within the line ROI which correspond to merged words.     (contoursWords , heirarchy) = cv2.findContours(roi_line.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)      ### For each word contour, compute its bounding box and draw it on the original image.     for word in contoursWords:         x1,y1,w1,h1 = cv2.boundingRect(word)          ### Draw the word rectangle in yellow to visualize segmentation quality.         cv2.rectangle(img,(x+x1,y+y1),(x+x1+w1,y+y1+h1),(255,255,0),2)           ### Save absolute coordinates for later sorting and indexing.         words.append([x+x1,y+y1,x+x1+w1,y+y1+h1])      ### Sort all accumulated words by the left x coordinate so reading order is preserved.     # sort the words by the X position     sortedWords = sorted(words, key=lambda line : line[0]) #(x,y,w,h)      ### Build a list of (line_index, word_bbox) pairs representing a reading-order structure.     # build a full array of lines and words     for word in sortedWords:         a = (lineNumber,word)         all.append(a)      ### Increment line index for the next iteration.     lineNumber = lineNumber + 1  ### Optionally print the structure for debugging or downstream processing. #print(all)  ### Select an example word box from the combined list and print its coordinates. # show the first word in the first row chooseWord = all[3][1] print(chooseWord)  ### Extract the chosen word as a ROI from the original image to verify segmentation. roiWord = img[chooseWord[1]:chooseWord[3], chooseWord[0]:chooseWord[2]] cv2.imshow("Show a word", roiWord)  ### Show the original image with all drawn word rectangles for final inspection. cv2.imshow("Show the words",img)  ### Save the visualized segmentation to disk for documentation or OCR handoff. cv2.imwrite("c:/temp/segmentedBook.png",img)  ### Wait for a key press to keep windows open, then close all GUI windows cleanly. cv2.waitKey(0) cv2.destroyAllWindows() 

You can find the code here : https://ko-fi.com/s/d621f2eb2c

You converted the image to a clean binary mask, merged lines and words via dilation, extracted contours, and drew bounding boxes.
This is a practical, fast baseline for OpenCV word segmentation that boosts OCR accuracy with minimal dependencies.

Connect :

☕ Buy me a coffee — https://ko-fi.com/eranfeit

🖥️ Email : feitgemel@gmail.com

🌐 https://eranfeit.net

🤝 Fiverr : https://www.fiverr.com/s/mB3Pbb

Enjoy,

Eran

error: Content is protected !!
Eran Feit