OCRmyPDF with img2pdf docker image (minidocks/ocrmypdf)
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched.
- img2pdf is a tool to lossless conversion of raster images to PDF.
- pdfminer is a tool for extracting information from PDF documents.
- hocr tools
OCRmyPDF requires ghostscript, tesseract an unpaper. We must connect these containers via the ssh protocol. The easiest solution is to use docker compose.
So create a file compose.yaml
with content:
x-base: &base
volumes:
- .:/app
- ./tmp:/tmp
working_dir: /app
command: sshd
services:
ocrmypdf:
<<: *base
image: minidocks/ocrmypdf
links:
- tesseract
- unpaper
- gs
environment:
ALIAS_TESSERACT: ssh tesseract tesseract
ALIAS_UNPAPER: ssh unpaper unpaper
ALIAS_GS: ssh gs gs
gs:
<<: *base
image: minidocks/ghostscript
tesseract:
<<: *base
image: minidocks/tesseract:4-eng
environment:
OMP_THREAD_LIMIT: 1
unpaper:
<<: *base
image: minidocks/unpaper
And in the same directory run command:
docker compose run --rm ocrmypdf -j 2 -l eng --tesseract-pagesegmode 3 input.pdf output.pdf
Tag | Size |
---|---|
latest |