Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] Omnisearch Fails to Extract from 15% of PDFs #7

Closed
LloydThinks opened this issue Jan 10, 2023 · 6 comments
Closed

[BUG] Omnisearch Fails to Extract from 15% of PDFs #7

LloydThinks opened this issue Jan 10, 2023 · 6 comments

Comments

@LloydThinks
Copy link

Problem description:

Omnisearch appears to fail on many PDFs. I understand that some PDFs, particularly from decades ago, will not be able to be indexed. However, I am finding that ~15% of my PDFs are not being indexed, which concerns me.

I am well aware that this is likely not a bug with Omnisearch itself, but rather a limitation of the PDF indexing library, or something similar. However, I would like to know if there is anything I can do to understand what is going on. The Developer Console logs provide no additional information.

Your environment:

  • Omnisearch version: 1.9.1
  • Obsidian version: 1.1.9
  • Operating system: macOS Ventura 13.0
  • Number of notes in your vault (approx.): 134 MD files, 481 PDFs in a Resources folder. Omnisearch says it has 689 files total before indexing.
  • Other plugins that may be related to the issue: Full plugin list: Advanced Tables, Linter, and Omnisearch

Thank you,
Lloyd

@scambier
Copy link
Owner

(I'm in the process of moving all the text extraction stuff to https://github.com/scambier/obsidian-text-extractor)

So, the PDFs are processed by https://github.com/jrmuizel/pdf-extract, and there's indeed little information in case of an error, because the feedback given by the lib is mostly useless. There are 2 possibilities if it returns an empty text:

  • either something inside the file triggered an error in the rust lib
  • or the timeout of 2 minutes has been reached. 99% of the time that just means the file is too large. That case is logged in the console.

The solution I'd like to implement is to fallback to using OCR on failed PDFs.

@LloydThinks
Copy link
Author

Ah okay. A limitation for now I will learn to live with. Better to index 85% of the files instead of 0% of the files.

Thank you for the great plugin!

@scambier scambier transferred this issue from scambier/obsidian-omnisearch Jan 23, 2023
@gwillcox-r7
Copy link

gwillcox-r7 commented Mar 15, 2023

@scambier Just for confirmation would this mean that if a PDF was made up of mostly just images that it would be able to parse the text from said images within the PDF? Not sure if this ties into this request here or if should be a separate feature request; if so let me know and I can remove this comment and raise a separate feature request.

Edit: So apparently this can get confusing as with PDFs that have been edited so that they have images but OCR via a seperate tool or process such as Adobe has been applied to them to make the images into text, this tool and in turn OmniSearch finds this fine. However lets say you have that and a image with the words XLS in it, and you can't open up a PDF viewer and highlight that text (like you would with said other text). In this case you couldn't extract the text from these images.

@scambier
Copy link
Owner

Image files (.png, .jpg) go through OCR, and PDFs go through the text extraction (no OCR involved at all). Image data in PDFs is completely ignored. I've had conflicting reports on "pre-OCRed PDFs", I guess it depends on the tool and the OS.

@L1Z3
Copy link

L1Z3 commented Sep 27, 2023

I mentioned this in another issue, but I feel like I should bring it up here too since it might help someone. I have found that for numerous PDFs that Text Extractor fails to extract, it is possible to get it to consistently extract text using the following workaround:

  1. Load the PDF into Xournal++
  2. Immediately export the PDF (File -> Export as PDF).
  3. Load the modified PDF into Obsidian
  4. "Right click -> Clear cache for this file" on the exported PDF
  5. "Right click -> Extract text to clipboard" on the exported PDF

In my testing, this consistently fixes the issue. It is currently a fairly manual process, but I imagine it could scripted and/or introduced into Text Extractor or a companion plugin if we can reproduce Xournal++'s PDF export behavior (and confirm that this fixes the issue more broadly instead of just for the problem PDFs that I tested).

@scambier
Copy link
Owner

Issues related to PDF extraction are centralized here: #21

@scambier scambier closed this as not planned Won't fix, can't repro, duplicate, stale Dec 16, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants