[BUG] Omnisearch Fails to Extract from 15% of PDFs #7

LloydThinks · 2023-01-10T10:29:54Z

Problem description:

Omnisearch appears to fail on many PDFs. I understand that some PDFs, particularly from decades ago, will not be able to be indexed. However, I am finding that ~15% of my PDFs are not being indexed, which concerns me.

I am well aware that this is likely not a bug with Omnisearch itself, but rather a limitation of the PDF indexing library, or something similar. However, I would like to know if there is anything I can do to understand what is going on. The Developer Console logs provide no additional information.

Your environment:

Omnisearch version: 1.9.1
Obsidian version: 1.1.9
Operating system: macOS Ventura 13.0
Number of notes in your vault (approx.): 134 MD files, 481 PDFs in a Resources folder. Omnisearch says it has 689 files total before indexing.
Other plugins that may be related to the issue: Full plugin list: Advanced Tables, Linter, and Omnisearch

Thank you,
Lloyd

The text was updated successfully, but these errors were encountered:

scambier · 2023-01-10T18:12:49Z

(I'm in the process of moving all the text extraction stuff to https://github.com/scambier/obsidian-text-extractor)

So, the PDFs are processed by https://github.com/jrmuizel/pdf-extract, and there's indeed little information in case of an error, because the feedback given by the lib is mostly useless. There are 2 possibilities if it returns an empty text:

either something inside the file triggered an error in the rust lib
or the timeout of 2 minutes has been reached. 99% of the time that just means the file is too large. That case is logged in the console.

The solution I'd like to implement is to fallback to using OCR on failed PDFs.

LloydThinks · 2023-01-17T10:36:46Z

Ah okay. A limitation for now I will learn to live with. Better to index 85% of the files instead of 0% of the files.

Thank you for the great plugin!

gwillcox-r7 · 2023-03-15T19:05:14Z

@scambier Just for confirmation would this mean that if a PDF was made up of mostly just images that it would be able to parse the text from said images within the PDF? Not sure if this ties into this request here or if should be a separate feature request; if so let me know and I can remove this comment and raise a separate feature request.

Edit: So apparently this can get confusing as with PDFs that have been edited so that they have images but OCR via a seperate tool or process such as Adobe has been applied to them to make the images into text, this tool and in turn OmniSearch finds this fine. However lets say you have that and a image with the words XLS in it, and you can't open up a PDF viewer and highlight that text (like you would with said other text). In this case you couldn't extract the text from these images.

scambier · 2023-03-15T20:20:16Z

Image files (.png, .jpg) go through OCR, and PDFs go through the text extraction (no OCR involved at all). Image data in PDFs is completely ignored. I've had conflicting reports on "pre-OCRed PDFs", I guess it depends on the tool and the OS.

L1Z3 · 2023-09-27T15:52:33Z

I mentioned this in another issue, but I feel like I should bring it up here too since it might help someone. I have found that for numerous PDFs that Text Extractor fails to extract, it is possible to get it to consistently extract text using the following workaround:

Load the PDF into Xournal++
Immediately export the PDF (File -> Export as PDF).
Load the modified PDF into Obsidian
"Right click -> Clear cache for this file" on the exported PDF
"Right click -> Extract text to clipboard" on the exported PDF

In my testing, this consistently fixes the issue. It is currently a fairly manual process, but I imagine it could scripted and/or introduced into Text Extractor or a companion plugin if we can reproduce Xournal++'s PDF export behavior (and confirm that this fixes the issue more broadly instead of just for the problem PDFs that I tested).

scambier · 2023-12-16T09:45:43Z

Issues related to PDF extraction are centralized here: #21

scambier transferred this issue from scambier/obsidian-omnisearch Jan 23, 2023

L1Z3 mentioned this issue Sep 27, 2023

[BUG] Inserts extraneous spaces in PDF #41

Closed

scambier closed this as not planned Won't fix, can't repro, duplicate, stale Dec 16, 2023

figadore mentioned this issue Jan 27, 2024

[Feature request] Search existing PDFs with their own embedded OCR data #49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Omnisearch Fails to Extract from 15% of PDFs #7

[BUG] Omnisearch Fails to Extract from 15% of PDFs #7

LloydThinks commented Jan 10, 2023

scambier commented Jan 10, 2023

LloydThinks commented Jan 17, 2023

gwillcox-r7 commented Mar 15, 2023 •

edited

Loading

scambier commented Mar 15, 2023

L1Z3 commented Sep 27, 2023 •

edited

Loading

scambier commented Dec 16, 2023

[BUG] Omnisearch Fails to Extract from 15% of PDFs #7

[BUG] Omnisearch Fails to Extract from 15% of PDFs #7

Comments

LloydThinks commented Jan 10, 2023

scambier commented Jan 10, 2023

LloydThinks commented Jan 17, 2023

gwillcox-r7 commented Mar 15, 2023 • edited Loading

scambier commented Mar 15, 2023

L1Z3 commented Sep 27, 2023 • edited Loading

scambier commented Dec 16, 2023

gwillcox-r7 commented Mar 15, 2023 •

edited

Loading

L1Z3 commented Sep 27, 2023 •

edited

Loading