OCRmyPDF with img2pdf docker image (minidocks/ocrmypdf)

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched.

Utilities

img2pdf is a tool to lossless conversion of raster images to PDF.
pdfminer is a tool for extracting information from PDF documents.
hocr tools

Usage

OCRmyPDF requires ghostscript, tesseract an unpaper. We must connect these containers via the ssh protocol. The easiest solution is to use docker compose.

So create a file compose.yaml with content:

x-base: &base
  volumes:
    - .:/app
    - ./tmp:/tmp
  working_dir: /app
  command: sshd

services:
  ocrmypdf:
    <<: *base
    image: minidocks/ocrmypdf
    links:
      - tesseract
      - unpaper
      - gs
    environment:
      ALIAS_TESSERACT: ssh tesseract tesseract
      ALIAS_UNPAPER: ssh unpaper unpaper
      ALIAS_GS: ssh gs gs

  gs:
    <<: *base
    image: minidocks/ghostscript

  tesseract:
    <<: *base
    image: minidocks/tesseract:4-eng
    environment:
      OMP_THREAD_LIMIT: 1

  unpaper:
    <<: *base
    image: minidocks/unpaper

And in the same directory run command:

docker compose run --rm ocrmypdf -j 2 -l eng --tesseract-pagesegmode 3 input.pdf output.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
rootfs/docker-entrypoint.d		rootfs/docker-entrypoint.d
Dockerfile		Dockerfile
README.md		README.md
build.sh		build.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCRmyPDF with img2pdf docker image (minidocks/ocrmypdf)

Utilities

Usage

Tags

Related images

About

Releases

Packages

Languages

minidocks/ocrmypdf

Folders and files

Latest commit

History

Repository files navigation

OCRmyPDF with img2pdf docker image (minidocks/ocrmypdf)

Utilities

Usage

Tags

Related images

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages