Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Feature request] Search existing PDFs with their own embedded OCR data #49

Closed
figadore opened this issue Dec 14, 2023 · 3 comments
Closed

Comments

@figadore
Copy link

Is your feature request related to a problem? Please describe.
I already have tons of PDFs where there seems to be OCR data built-in. i.e. I can search within the PDF, and text is highlighted and can be copied to the clipboard. I am not able to search using this existing data in Obsidian. Using text extractor, many PDFs do not get OCR at all or get insufficient OCR results.

Describe the solution you'd like
Use existing embedded OCR data when available

Describe alternatives you've considered
None so far

Additional context
Some existing data is from Doxie, some is from a Canon printer/scanner

Caveats
Maybe I'm misunderstanding the fundamentals of how OCR works in my existing documents, or some other aspect of omnisearch

@scambier scambier transferred this issue from scambier/obsidian-omnisearch Dec 16, 2023
@scambier
Copy link
Owner

See #21

@scambier scambier closed this as not planned Won't fix, can't repro, duplicate, stale Dec 16, 2023
@figadore
Copy link
Author

figadore commented Jan 27, 2024

See #21

My understanding of #21 and #7 is that they are mostly related to OCR of images, extracting the main(?) text from a PDF, or falling back on performing OCR on a PDF again. On the other hand, it sounds like pdf-extract may be the component that is expected to extract the pre-OCRd text and help make it searchable?

@scambier
Copy link
Owner

Text Extractor does not perform OCR on PDFs, it just uses the pdf-extract library to (try to) extract existing text data that might be present. The problem is that this library doesn't work very well and often fails even on PDFs that contain clean or OCRed text.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants