User Projects – 3rdParty

GUIs and Other Projects using Tesseract OCR

GUI

Name	Linux	Mac	Windows	License	Description
Free-Ocr-Windows-Desktop			X	GNU AGPL v3	Free OCR application for the Windows Desktop - Essentially a graphical user interface (GUI) for the Tesseract OCR engine. The application also includes support for reading and OCR'ing PDF files
YAGF	X			GPL v3	A graphical front-end for cuneiform and tesseract
gImageReader	X		X	GPL v3	A graphical GTK frontend to tesseract-ocr
SunnyPage OCR			X	Proprietary	A GUI frontend for Tesseract OCR engine with automatic adjustment of image brightness, image processing and PDF support.
VietOCR	X	X	X	Apache 2.0	A GUI frontend for Tesseract OCR engine. Supports optical character recognition for Vietnamese and other languages supported by Tesseract
OCRFeeder	X			GPL v3	OCRFeeder is a document layout analysis and optical character recognition system
PDF OCR X		X	X	Proprietary	PDF OCR is a simple drag-and-drop utility for Mac OS X and Windows, that converts your PDFs and images into text documents or searchable PDF files
Lector	X		X	GPL v2	A graphical ocr solution for GNU/Linux based on Python, Qt4 and tessaract OCR
Tesseract-OCR QT4 gui	X			Apache 2.0	Tesseract-OCR QT4 gui is a simple GUI for tesseract
Lime OCR			X	GPL v3	A simple, free OCR software for Windows using tesseract-ocr engine
Ocrivist	X			GPL v3	Ocrivist is a utility which makes it possible to scan and OCR books and other printed documents to PDF or Djvu format
Tesseract-GUI	X			GPL v2	Tessract-GUI is not a front-end for tesseract-ocr, it is just a graphical way to use it with simple image manipulation through ImageMagick
QTesseract	X			LGPL v3	QT GUI for the Tesseract OCR
TessOCR(KISI)		X		Apache 2.0	A free OCR tool
pmOCR	X			BSD	Batch OCR tool, also file monitor event OCR with tesseract
tesseract4java	X	X	X	GPLv3	A cross-platform GUI for training and running Tesseract with advanced features like batch recognition and accuracy evaluation

Online OCR services

OCR.net: Powered by PDF OCR X in back-end. Converts PDFs and Images to Text or searchable PDF.
WeOCR: is a platform for Web-enabled OCR (Optical Character Reader/Recognition) systems that enables people to use character recognition over networks
CustomOCR
Free OCR
i2OCR

Mobile

Android:
- tess-two - A fork of Tesseract Tools for Android tesseract-android-tools that adds some additional functions.
- textfairy Android OCR App with source code at github.com
- Character Recognition Android OCR App with source code at gitorious.org
- tesseract-android-tools: set of Android APIs
- Mobile OCR: The goal of Mobile OCR is to create an application for the Android platform that will recognize text from an image taken by the phone's camera. The application will be fully accessible to low vision and blind users
- Across India: An app which lets users take pictures of sign boards in Indian Languages or English and transliterate it to the the language that they can read.
iOS:
- Tesseract-OCR-iOS - Tesseract OCR iOS is a Framework for iOS7+, compiled also for armv7s and arm64.
- OCR-iOS-Example - a simple example of how to do optical character recognition (OCR) on iOS.
- Tesseract-iPhone-Demo - example based on tesseract 2.04.
More OS:
- ScanBizCards: Mobile solution for business card scanning. Requirements: iPhone 4/iPhone 3/Android 2.0

Others

ocr-fileformat - Validate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader)
Tess4J - A Java JNA wrapper for Tesseract OCR API.
Traineddata inspector - to inspect some of the internals of traineddata files
TopOCR - high Quality OCR for Cameras with tesseract-ocr support (paid product)
Simple OCR Web Server using python, flask, tesseract-ocr, and leptonica
Display OCR is OpenCV-Python + python-tesseract real-time image preprocess and OCR of 7 segments font.
OpenOCR makes it simple to host your own OCR REST API.
https://github.com/guitarmind/tesseract-web-service is An implementation of RESTful web service for tesseract-OCR using tornado
RasterEdge .NET Image SDK - OCR Recognition is robust, high-performance recognition application of royalty-free distribution for desktop or server applications.
DevScope OCR SDK is a Optical Character Recognition toolkit engine based on Tesseract OCR v3 that allows to develop applications using Microsoft .NET framework
Paperwork - using OCR to grep dead trees the easy way (requires pyocr)
Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments
gscan2pdf a GUI to produce PDFs or DjVus from scanned documents
Audiveris is an open-source Optical Music Recognition software which processes the image of a music sheet to automatically provide symbolic music information in MusicXML standard.
Ocrivist is a utility which makes it possible to scan and OCR books and other printed documents to PDF or Djvu format.
thu-ipv6-login a python script for IPv6 authentication in Tsinghua University with support for OCR of authcode
Wolfram Mathematica 9.0 use tesseract for recognizing text
node-dv is a node.js library for processing and understanding scanned documents
hocr-tools - python tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML. They include hocr-pdf tool for creating searchable pdf.
PyPDFOCR - Tesseract-OCR based PDF filing
OCRmyPDF - adds OCR text layer to scanned PDF files, allowing them to be searched. Places OCRed text accurately below the image to ease copy/paste. Keeps exact resolution of original embedded images or, if requested, oversamples the images before OCRing so as to get better results. If requested, deskews and/or cleans the image before performing OCR. Validates generated file against PDF/A-1b specification using JHOVE. Debug mode for easy verification of OCR results. Processes pages in parallel on multi-core CPUs.
ChronoScan is a complete suite for document Scanning & Data Entry
speedy-ocr utility to simplify scanning and OCR focus to help blind and visually impaired community. It is part of Vinux project.
Project VIRAL Varico Invoice Recognition with Assisted Learning
Bindery: A simple GUI for binding post processed scanned pages into digital documents
Clarify: Clarify helps you OCR 'image-only' PDFs. Your input is a PDF that you normally cannot extract text from. The output is text. Clarify is a python module that wraps up tesseract-ocr, xpdf and netpbm. Requirements: python, tesseract-ocr, xpdf, netpbm
hOcr2Pdf.NET: hOcr2Pdf.NET is a library that programmers can use to create highly compressed, searchable pdf's for applications. Requirements: .NET 2.0 or higher, Tesseract 3.0, JBig2.exe
PDFBeads: convert scanned images to a single searchable PDF file based on hOCR files. Requirements: ruby, RMagick, hpricot
ExactImage/hocr2pdf: creates a Searchable PDF from hOCR input. Requirements: libagg
HocrConverter: creates PDFs and plain text from hOCR documents. Requirements: python, reportlab
HocrToPdf.java: java source for very basic hOCR to PDF converter. Compiled version can be found at project modi2hocr. Requirements: java, jericho, iText2
hOcr2Pdf.NET: is a .NET library to convert .hocr html produced by Tesseract or Cuneiform into searchable pdfs using HtmlAgilityPack and iTextSharp. Requirements: C#.
Tally-Ho: Tally-Ho is a screen reader intended for sites like google books
Mayan EDMS: Document management system with tesseract as it's base
Olena: a generic and efficient image processing platform (tesseract is used in its part called scribo)
ocrodjvu is a wrapper for OCR systems, that allows you to perform OCR on DjVu files
PaRADIIT (Pattern Redundancy Analysis for Document Image Indexation & Transcription) is a project initiated and sponsored by 2 successive Google DH awards. It aims to turn ancient books, especially from the Renaissance, into accessible digital libraries.
The ISRI Analytic Tools consist of 17 tools for measuring the performance of and experimenting with OCR output.
pdf2pdfocr is a tool to OCR a PDF (or supported images) and add a text layer in the original file making it a searchable PDF. It is a python script that uses tesseract and other open source tools. Linux, macOS and Windows supported.

IMPACT related

IMPACT project
IMPACT Centre - a not-for-profit organisation founded to sustain IMPACT outcomes and foster community building
IMPACT data
IMPACT tools
Results of the IMPACT project by PSNC Digital Libraries Team
Virtual Transcription Laboratory by PSNC
IMPACT Interoperability Framework - interoperability layer supporting the loose coupling of software components developed during the IMPACT project.
Inventory-Extraction-Tool Prototype is a prototype with graphical user interface (GUI) that allows for the extraction of a complete list of characters from a document, without reference to a specific language dictionary or a library of fonts.
Post Correction Tool is interactive post-correction of OCRed documents. Using the information obtained by the Text and Error Profiler the whole correction process is adaptive to the document being processed. In this way, usually huge numbers of systematic errors can be corrected with just a few keystrokes..
OCR evaluation tool.
BlackLab is a corpus retrieval engine built on top of Apache Lucene. It allows fast, complex searches with accurate hit highlighting on large, tagged and annotated, bodies of text. It was developed at the Institute of Dutch Lexicology (INL) to provide a fast and feature-rich search interface on our historical and contemporary text corpora.

For more information about IMPACT project see discussion in tesseract forum.

Old wiki - no longer maintained. The pages were moved, see the new documentation.

As of 02/02/2020

These wiki pages are no longer maintained.

All pages were moved to tesseract-ocr/tessdoc.

The latest documentation is available at https://tesseract-ocr.github.io/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly