[Feature request] Search existing PDFs with their own embedded OCR data #49

figadore · 2023-12-14T23:55:08Z

Is your feature request related to a problem? Please describe.
I already have tons of PDFs where there seems to be OCR data built-in. i.e. I can search within the PDF, and text is highlighted and can be copied to the clipboard. I am not able to search using this existing data in Obsidian. Using text extractor, many PDFs do not get OCR at all or get insufficient OCR results.

Describe the solution you'd like
Use existing embedded OCR data when available

Describe alternatives you've considered
None so far

Additional context
Some existing data is from Doxie, some is from a Canon printer/scanner

Caveats
Maybe I'm misunderstanding the fundamentals of how OCR works in my existing documents, or some other aspect of omnisearch

scambier · 2023-12-16T09:36:07Z

See #21

figadore · 2024-01-27T06:48:23Z

See #21

My understanding of #21 and #7 is that they are mostly related to OCR of images, extracting the main(?) text from a PDF, or falling back on performing OCR on a PDF again. On the other hand, it sounds like pdf-extract may be the component that is expected to extract the pre-OCRd text and help make it searchable?

scambier · 2024-01-27T08:10:47Z

Text Extractor does not perform OCR on PDFs, it just uses the pdf-extract library to (try to) extract existing text data that might be present. The problem is that this library doesn't work very well and often fails even on PDFs that contain clean or OCRed text.

scambier transferred this issue from scambier/obsidian-omnisearch Dec 16, 2023

scambier closed this as not planned Won't fix, can't repro, duplicate, stale Dec 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Search existing PDFs with their own embedded OCR data #49

[Feature request] Search existing PDFs with their own embedded OCR data #49

figadore commented Dec 14, 2023

scambier commented Dec 16, 2023

figadore commented Jan 27, 2024 •

edited

Loading

scambier commented Jan 27, 2024

[Feature request] Search existing PDFs with their own embedded OCR data #49

[Feature request] Search existing PDFs with their own embedded OCR data #49

Comments

figadore commented Dec 14, 2023

scambier commented Dec 16, 2023

figadore commented Jan 27, 2024 • edited Loading

scambier commented Jan 27, 2024

figadore commented Jan 27, 2024 •

edited

Loading