-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[BUG] Inserts extraneous spaces in PDF #41
Comments
Update: I have found a workaround to this issue. If I load the PDF into Xournal++ and then export it as a PDF from there, Text Extractor suddenly has no problems. I suspect the issue has something to do with how the original app that produced the PDF encoded the text and that Xournal++ converts it to some different format, but this is just speculation. |
In principle, the behavior Xournal++ uses to export the PDF could be replicated in an Obsidian plugin, which could be a very hacky way to workaround this issue in a semi-automated way. I have also encountered the issue in #7 in which some PDFs fail to extract altogether, and using Xournal++ to re-export such PDFs also seems to allow them to have their text extracted for me (at least after a couple "Clear cache for this file" -> "Extract text to clipboard" cycles). |
could be linked to this: https://bugs.ghostscript.com/show_bug.cgi?id=696116 and: |
Nope, PDFs are not OCR'ed, just processed by https://github.com/jrmuizel/pdf-extract |
ah! of course.. thanks i'm getting this problem for 2 things:
|
Yes, Omnisearch uses data from Text Extractor so that's expected |
Problem description:
There are a few PDFs I have that have embedded text that Text Extractor fails to properly extract in the correct format. For instance, see this PDF. For many of the slides, Text Extractor inserts numerous extraneous spaces. For instance, on page 6 of this PDF, Text Extractor extracts this line:
N e tw o rk s o c k e ts a r e f i l e d e s c ri p to rs to o
whereas copying and pasting directly from the PDF within Obsidian or another PDF viewer gives:
Network sockets are file descriptors too
Your environment:
The text was updated successfully, but these errors were encountered: