Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] Inserts extraneous spaces in PDF #41

Closed
L1Z3 opened this issue Sep 20, 2023 · 6 comments
Closed

[BUG] Inserts extraneous spaces in PDF #41

L1Z3 opened this issue Sep 20, 2023 · 6 comments

Comments

@L1Z3
Copy link

L1Z3 commented Sep 20, 2023

Problem description:

There are a few PDFs I have that have embedded text that Text Extractor fails to properly extract in the correct format. For instance, see this PDF. For many of the slides, Text Extractor inserts numerous extraneous spaces. For instance, on page 6 of this PDF, Text Extractor extracts this line:
N e tw o rk s o c k e ts a r e f i l e d e s c ri p to rs to o
whereas copying and pasting directly from the PDF within Obsidian or another PDF viewer gives:
Network sockets are file descriptors too

Your environment:

  • Plugin version: 0.4.6
  • Obsidian version: 1.4.13
  • Operating system: Fedora 38
  • Number of images/PDFs in your vault (approx.): 6
  • Other plugins that may be related to the issue: N/A
@L1Z3
Copy link
Author

L1Z3 commented Sep 27, 2023

Update: I have found a workaround to this issue. If I load the PDF into Xournal++ and then export it as a PDF from there, Text Extractor suddenly has no problems. I suspect the issue has something to do with how the original app that produced the PDF encoded the text and that Xournal++ converts it to some different format, but this is just speculation.

@L1Z3
Copy link
Author

L1Z3 commented Sep 27, 2023

In principle, the behavior Xournal++ uses to export the PDF could be replicated in an Obsidian plugin, which could be a very hacky way to workaround this issue in a semi-automated way. I have also encountered the issue in #7 in which some PDFs fail to extract altogether, and using Xournal++ to re-export such PDFs also seems to allow them to have their text extracted for me (at least after a couple "Clear cache for this file" -> "Extract text to clipboard" cycles).

@jh0274
Copy link

jh0274 commented Oct 12, 2023

@scambier
Copy link
Owner

Nope, PDFs are not OCR'ed, just processed by https://github.com/jrmuizel/pdf-extract

@jh0274
Copy link

jh0274 commented Oct 12, 2023

ah! of course.. thanks

i'm getting this problem for 2 things:

  • when i extract text from pdf
  • when i search using Omnisearch (which uses text-extractor i believe?)

@scambier
Copy link
Owner

Yes, Omnisearch uses data from Text Extractor so that's expected

@scambier scambier closed this as not planned Won't fix, can't repro, duplicate, stale Dec 16, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants