You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I already have tons of PDFs where there seems to be OCR data built-in. i.e. I can search within the PDF, and text is highlighted and can be copied to the clipboard. I am not able to search using this existing data in Obsidian. Using text extractor, many PDFs do not get OCR at all or get insufficient OCR results.
Describe the solution you'd like
Use existing embedded OCR data when available
Describe alternatives you've considered
None so far
Additional context
Some existing data is from Doxie, some is from a Canon printer/scanner
Caveats
Maybe I'm misunderstanding the fundamentals of how OCR works in my existing documents, or some other aspect of omnisearch
The text was updated successfully, but these errors were encountered:
My understanding of #21 and #7 is that they are mostly related to OCR of images, extracting the main(?) text from a PDF, or falling back on performing OCR on a PDF again. On the other hand, it sounds like pdf-extract may be the component that is expected to extract the pre-OCRd text and help make it searchable?
Text Extractor does not perform OCR on PDFs, it just uses the pdf-extract library to (try to) extract existing text data that might be present. The problem is that this library doesn't work very well and often fails even on PDFs that contain clean or OCRed text.
Is your feature request related to a problem? Please describe.
I already have tons of PDFs where there seems to be OCR data built-in. i.e. I can search within the PDF, and text is highlighted and can be copied to the clipboard. I am not able to search using this existing data in Obsidian. Using text extractor, many PDFs do not get OCR at all or get insufficient OCR results.
Describe the solution you'd like
Use existing embedded OCR data when available
Describe alternatives you've considered
None so far
Additional context
Some existing data is from Doxie, some is from a Canon printer/scanner
Caveats
Maybe I'm misunderstanding the fundamentals of how OCR works in my existing documents, or some other aspect of omnisearch
The text was updated successfully, but these errors were encountered: