Integrating object detection with YOLO and Optical Character Recognition (OCR) using Tesseract. The goal is to identify specific regions in images (e.g., invoices) using YOLO and extract text from those regions using OCR. We'll go step by step, from setting up the environment to implementing the solution.
Read the full article here.
To start, ensure you have the necessary libraries and tools installed:
-
Install YOLOv8: YOLOv8 is part of the
ultralytics
package, which provides powerful tools for object detection and segmentation.pip install ultralytics
-
Install OpenCV: OpenCV is used for image processing.
pip install opencv-python
-
Install Pytesseract: Pytesseract acts as a Python wrapper for Tesseract OCR.
pip install pytesseract
-
Install Tesseract OCR: Download and install Tesseract OCR from the official repository if you're on Windows, or use your package manager for Linux:
sudo apt install tesseract-ocr # For Ubuntu
On Windows, add the Tesseract installation path (e.g.,
C:\Program Files\Tesseract-OCR
) to your system's PATH.
Assume you've trained a YOLO model to detect regions of interest in your images, such as specific fields in invoices.
Here’s how to load the trained YOLO model:
from ultralytics import YOLO
# Load the best-trained YOLO model
model = YOLO('path/to/your/best.pt')
The model is now ready to detect objects in images.
Define a function to crop detected regions and extract text using Tesseract OCR.
import cv2
import pytesseract
def perform_ocr(image, detections):
"""
Perform OCR on cropped regions from the detected bounding boxes and include class names.
"""
for i, detection in enumerate(detections):
# Extract bounding box and class name
x1, y1, x2, y2, class_name = detection
cropped_image = image[int(y1):int(y2), int(x1):int(x2)]
# Preprocess for better OCR results
gray = cv2.cvtColor(cropped_image, cv2.COLOR_BGR2GRAY)
_, binary = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)
# Perform OCR using Tesseract
text = pytesseract.image_to_string(binary, lang='eng')
print(f"Class '{class_name}' detected: {text.strip()}")
# Optional: Display cropped region
cv2.imshow(f"Region - {class_name}", binary)
cv2.waitKey(0)
cv2.destroyAllWindows()
This function:
- Crops detected regions.
- Converts them to grayscale and applies thresholding for better OCR results.
- Extracts text using Tesseract.
Process each image in a folder, run the YOLO model, and apply the OCR function.
import os
# Path to images folder
images_folder = 'path/to/images/folder'
# Iterate through images
for file_name in os.listdir(images_folder):
if file_name.endswith(('.jpg', '.jpeg', '.png')): # Check for image files
file_path = os.path.join(images_folder, file_name)
# Load the image
image = cv2.imread(file_path)
# Run YOLO model
results = model(file_path)
# Extract bounding boxes and class names
detections = []
for box in results[0].boxes.data.tolist():
x1, y1, x2, y2, conf, cls = box[:6]
class_name = results[0].names[int(cls)] # Map class index to class name
detections.append((x1, y1, x2, y2, class_name))
print(f"Detected objects for {file_name}: {detections}")
# Perform OCR
perform_ocr(image, detections)
This script:
- Iterates over all image files in a directory.
- Detects objects using YOLO.
- Passes detected regions to the OCR function.
Preprocessing is crucial to improve OCR accuracy. Use techniques like:
- Grayscale Conversion:
Converts the image to grayscale for easier text recognition.
gray = cv2.cvtColor(cropped_image, cv2.COLOR_BGR2GRAY)
- Thresholding:
Enhances text visibility by binarizing the image.
_, binary = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)
Specify the language in Tesseract using the lang
parameter (e.g., English: lang='eng'
).
For each image, the script:
- Detects objects (e.g.,
invoice_number
,date
, etc.). - Extracts text from the detected regions.
- Displays the extracted text along with the class name.
For an invoice image with detected regions:
Detected objects for invoice1.jpg: [(50, 100, 200, 150, 'invoice_number'), (300, 400, 500, 450, 'date')]
Class 'invoice_number' detected: INV-2024-00123
Class 'date' detected: 11/25/2024
This step-by-step pipeline combines the power of YOLO for object detection with Tesseract for OCR. The solution can be applied to various use cases, such as:
- Automating data extraction from invoices, receipts, or documents.
- Analyzing text within detected regions in images.
The flexibility of YOLO and Tesseract ensures this pipeline can adapt to diverse applications. Experiment with different preprocessing techniques and model configurations to optimize performance for your specific dataset.