[BUG] Inserts extraneous spaces in PDF #41

L1Z3 · 2023-09-20T13:10:17Z

Problem description:

There are a few PDFs I have that have embedded text that Text Extractor fails to properly extract in the correct format. For instance, see this PDF. For many of the slides, Text Extractor inserts numerous extraneous spaces. For instance, on page 6 of this PDF, Text Extractor extracts this line:
N e tw o rk s o c k e ts a r e f i l e d e s c ri p to rs to o
whereas copying and pasting directly from the PDF within Obsidian or another PDF viewer gives:
Network sockets are file descriptors too

Your environment:

Plugin version: 0.4.6
Obsidian version: 1.4.13
Operating system: Fedora 38
Number of images/PDFs in your vault (approx.): 6
Other plugins that may be related to the issue: N/A

The text was updated successfully, but these errors were encountered:

L1Z3 · 2023-09-27T15:31:15Z

Update: I have found a workaround to this issue. If I load the PDF into Xournal++ and then export it as a PDF from there, Text Extractor suddenly has no problems. I suspect the issue has something to do with how the original app that produced the PDF encoded the text and that Xournal++ converts it to some different format, but this is just speculation.

L1Z3 · 2023-09-27T15:39:49Z

In principle, the behavior Xournal++ uses to export the PDF could be replicated in an Obsidian plugin, which could be a very hacky way to workaround this issue in a semi-automated way. I have also encountered the issue in #7 in which some PDFs fail to extract altogether, and using Xournal++ to re-export such PDFs also seems to allow them to have their text extracted for me (at least after a couple "Clear cache for this file" -> "Extract text to clipboard" cycles).

jh0274 · 2023-10-12T14:00:28Z

could be linked to this:

https://bugs.ghostscript.com/show_bug.cgi?id=696116

and:

tesseract-ocr/tesseract#373 ?

scambier · 2023-10-12T14:04:26Z

Nope, PDFs are not OCR'ed, just processed by https://github.com/jrmuizel/pdf-extract

jh0274 · 2023-10-12T14:39:42Z

ah! of course.. thanks

i'm getting this problem for 2 things:

when i extract text from pdf
when i search using Omnisearch (which uses text-extractor i believe?)

scambier · 2023-10-12T15:00:11Z

Yes, Omnisearch uses data from Text Extractor so that's expected

scambier closed this as not planned Won't fix, can't repro, duplicate, stale Dec 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Inserts extraneous spaces in PDF #41

[BUG] Inserts extraneous spaces in PDF #41

L1Z3 commented Sep 20, 2023

L1Z3 commented Sep 27, 2023

L1Z3 commented Sep 27, 2023 •

edited

Loading

jh0274 commented Oct 12, 2023

scambier commented Oct 12, 2023

jh0274 commented Oct 12, 2023

scambier commented Oct 12, 2023

[BUG] Inserts extraneous spaces in PDF #41

[BUG] Inserts extraneous spaces in PDF #41

Comments

L1Z3 commented Sep 20, 2023

L1Z3 commented Sep 27, 2023

L1Z3 commented Sep 27, 2023 • edited Loading

jh0274 commented Oct 12, 2023

scambier commented Oct 12, 2023

jh0274 commented Oct 12, 2023

scambier commented Oct 12, 2023

L1Z3 commented Sep 27, 2023 •

edited

Loading