Skip to content

c-mauderer/HocrConverter

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

HocrConverter

Create PDFs and plain text from hOCR documents

Originally from jbrinley See https://github.com/jbrinley/HocrConverter and http://xplus3.net/2009/04/02/convert-hocr-to-pdf/

Changes by C.Holtermann

Original script didn't work for me so I made some changes to make it work for me

My configuration is ocropus 0.7 and tesseract 3.02.02

Included some aspects from the fork of https://github.com/zw/HocrConverter:

Some command line arguments:

  • draw bounding boxes
  • draw text
  • inverse height ( tesseract and ocropus count differently )
  • multiple pages
  • include Images ( from hOCR or via command line )
  • verbosity

For command line parsing and validation I use some external libraries:

  • docopt
  • schema

Like this the script is rather something to understand the concept.

Maybe it's useful for others trying to understand OCR.

Changes by tristelune1

  • this script is for python3
  • text is search recursively in span tags

Work in progress.

About

Create PDFs and plain text from hOCR documents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%