Skip to content

dhdaines/benchmarks

 
 

Repository files navigation

PDF Library Benchmarks

This benchmark is about reading pure PDF files - notscanned documents and not documents that applied OCR.

Benchmarking machine

Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz

Input Documents

# Name File Size Pages
1 2201.00214 2.4MiB 22
2 GeoTopo-book 5.1MiB 117
3 2201.00151 1.5MiB 12
4 1707.09725 7.0MiB 134
5 2201.00021 2.6MiB 10
6 2201.00037 2.9MiB 33
7 2201.00069 14.7MiB 15
8 2201.00178 2.3MiB 16
9 2201.00201 1.3MiB 9
10 1602.06541 2.9MiB 16
11 2201.00200 284.8KiB 7
12 2201.00022 1.2MiB 14
13 2201.00029 797.6KiB 12
14 1601.03642 1004.9KiB 8

Libraries

Name Last PyPI Release License Version Dependencies
Borb 2024-08-03 AGPL/Commercial 2.1.16
pypdfium2 2024-12-19 Apache-2.0 or BSD-3-Clause 4.30.1 PDFium (Foxit/Google)
pdfminer.six 2024-07-06 MIT/X 20231228
pdfplumber 2025-01-01 MIT 0.11.5 pdfminer.six
pdfrw 2017-09-18 MIT 0.4
pdftotext - GPL 0.86.1 build-essential libpoppler-cpp-dev pkg-config python3-dev
playa 2025-02-20 MIT 0.3.0
PyMuPDF 2025-02-06 GNU AFFERO GPL 3.0 / Commerical 1.25.3 MuPDF
pypdf 2025-02-09 BSD 3-Clause 5.3.0
Tika 2023-01-01 Apache v2 2.6.0 Apache Tika

Text Extraction Speed

# Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 pypdfium2 0.1s 0.8s 0.3s 0.2s 0.2s 0.0s 0.1s 0.1s 0.1s 0.0s 0.1s 0.0s 0.1s 0.0s 0.0s
2 PyMuPDF 0.2s 1.3s 0.4s 0.7s 0.3s 0.1s 0.2s 0.1s 0.1s 0.1s 0.1s 0.1s 0.1s 0.0s 0.0s
3 pdftotext 0.3s 1.0s 1.1s 0.3s 0.8s 0.1s 0.3s 0.2s 0.1s 0.1s 0.1s 0.1s 0.1s 0.0s 0.1s
4 playa 2.5s 17.2s 5.3s 4.4s 2.2s 0.7s 1.1s 0.6s 0.6s 0.4s 0.7s 0.5s 0.6s 0.4s 0.2s
5 pypdf 4.1s 28.7s 8.1s 8.1s 3.9s 1.2s 2.0s 0.8s 1.0s 0.8s 1.0s 0.9s 0.8s 0.6s 0.4s
6 pdfminer.six 9.0s 55.9s 23.7s 16.8s 8.9s 2.3s 4.0s 1.8s 2.2s 1.5s 2.7s 1.8s 2.0s 1.1s 0.9s
7 pdfplumber 13.0s 86.4s 22.7s 23.4s 14.2s 4.2s 7.1s 3.3s 3.2s 2.9s 4.4s 3.3s 3.5s 1.9s 1.7s
8 Tika 24.4s 17.8s 100.1s 0.6s 23.4s 47.3s 48.3s 31.5s 34.5s 0.1s 13.2s 0.1s 24.2s 0.1s 0.1s
9 Borb 50.5s 188.4s 149.1s 2.3s 113.6s 28.4s 11.7s 112.3s 23.7s 27.1s 8.4s 5.7s 27.7s 4.9s 2.9s

Image Extraction Speed

# Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 PyMuPDF 0.6s 0.3s 0.7s 0.0s 2.2s 0.6s 0.0s 3.3s 0.5s 0.5s 0.1s 0.0s 0.4s 0.3s 0.0s
2 pypdfium2 1.3s 1.5s 2.3s 0.0s 4.3s 1.2s 0.2s 5.7s 0.9s 0.9s 0.3s 0.1s 0.7s 0.3s 0.0s
3 pypdf 5.2s 24.6s 7.0s 6.6s 18.9s 1.7s 0.7s 7.6s 1.5s 1.5s 0.9s 0.2s 1.3s 0.3s 0.2s
4 pdfminer.six 12.3s 69.2s 24.6s 20.6s 36.6s 2.6s 4.1s 2.4s 2.3s 1.5s 2.7s 2.0s 2.1s 1.1s 0.9s

Watermarking Speed

# Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 pdfrw 0.1s 0.1s 0.5s 0.1s 0.4s 0.1s 0.1s 0.2s 0.1s 0.1s 0.1s 0.1s 0.2s 0.0s 0.0s
2 PyMuPDF 0.2s 0.5s 0.7s 0.2s 0.5s 0.1s 0.1s 0.1s 0.1s 0.1s 0.1s 0.0s 0.1s 0.0s 0.0s
3 pypdf 0.6s 0.7s 2.3s 0.5s 1.7s 0.3s 0.4s 0.5s 0.4s 0.2s 0.5s 0.2s 0.6s 0.1s 0.1s

Watermarking File Size

# Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 pypdf 3.4MB 2.5MB 5.6MB 1.6MB 7.2MB 2.7MB 3.1MB 15.4MB 2.4MB 1.3MB 3.0MB 0.3MB 1.2MB 0.8MB 1.0MB
2 pdfrw 3.5MB 2.5MB 5.7MB 1.6MB 7.3MB 2.7MB 3.1MB 15.4MB 2.4MB 1.3MB 3.0MB 0.3MB 1.2MB 0.8MB 1.0MB
3 PyMuPDF 3.7MB 2.7MB 6.9MB 1.7MB 8.5MB 2.8MB 3.4MB 15.5MB 2.5MB 1.4MB 3.2MB 0.3MB 1.3MB 0.9MB 1.1MB

Text Extraction Quality

# Library Average 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 pypdfium2 97% 99% 97% 94% 99% 98% 96% 99% 99% 99% 99% 98% 78% 99% 99%
2 pypdf 96% 99% 95% 93% 98% 99% 96% 97% 99% 99% 99% 99% 78% 100% 99%
3 PyMuPDF 96% 98% 96% 93% 97% 98% 95% 99% 98% 98% 98% 97% 77% 98% 99%
4 playa 96% 98% 93% 93% 98% 98% 95% 97% 97% 98% 99% 98% 77% 96% 99%
5 pdfplumber 93% 96% 89% 89% 98% 92% 94% 93% 95% 93% 97% 94% 76% 99% 98%
6 pdftotext 92% 96% 94% 91% 95% 92% 96% 96% 96% 97% 83% 94% 77% 96% 79%
7 pdfminer.six 89% 95% 79% 86% 92% 86% 93% 95% 93% 92% 92% 93% 71% 98% 86%
8 Tika 83% 99% 0% 92% 95% 77% 86% 81% 82% 98% 88% 98% 67% 98% 96%
9 Borb 45% 70% 79% 0% 40% 48% 92% 0% 64% 51% 41% 55% 41% 0% 53%

About

Benchmarking PDF libraries

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%