Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Rewrite PSBaseParser and add an optimized in-memory version #1041

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

dhdaines
Copy link
Contributor

@dhdaines dhdaines commented Sep 19, 2024

The buffering in PSBaseParser (and associated code contortions like the mess that was PDFContentParser) was fragile but also unnecesary when doing file I/O - cPython's BufferedReader implementation is considerably faster, so we can just reimplement the "parser" (really a lexer) as a character-based state machine.

But this actually would lead to an overall slowdown, because in reality, most of the time we aren't parsing PDF data from a buffered file, but from a BytesIO wrapped around an in-memory buffer. In this case, the buffering is redundant but nonetheless faster in practice since it avoids the overhead of calling BytesIO.read repeatedly.

The obvious solution is to create a separate "parser" (really a lexer) using good old regular expressions the way our ancestors intended, and simply use this one when passed an in-memory buffer.

Also means that there is a bit less inheritance abuse in the code, as PSStackParser needs to delegate to the appropriate implementation.

Fixes: #885 and #1025

Also there were some details of the PDF parsing that were incorrect. Most notably, hex strings with odd length are supposed to be padded in big-endian fashion (i.e. <abcde> is equivalent to <abcde0>) but this was not the case in the existing code (which treated this as <abcd0e> instead).

Tested on the usual test suite with nox, profiled with cProfile and time.time.

Checklist

  • I have read CONTRIBUTING.md.
  • I have added a concise human-readable description of the change to CHANGELOG.md.
  • I have tested that this fix is effective or that this feature works.
  • I have added docstrings to newly created methods and classes.
  • I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant