New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

#

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Jump to bottom

rails invalid byte sequence in UTF-8 #135

Open

ghost opened this issue Sep 25, 2015 · 1 comment

ghost commented Sep 25, 2015

Hello, i got this error trying to OCR this pdf document: https://www.dropbox.com/s/ko76kalp5p59hwc/contrato%20de%20fianza%20prueba%2010.pdf?dl=0

The code which fails is:
Docsplit.extract_text(attachment.path, :output => output_dir, :language => 'spa').

I have tried using:

Docsplit.extract_text(attachment.path, :output => output_dir, :language => 'spa', :no_clean => true)
Docsplit.extract_text(attachment.path, :output => output_dir, :language => 'spa', :no_clean => false)
Docsplit.extract_text(attachment.path, :output => output_dir, :no_clean => true)
Docsplit.extract_text(attachment.path, :output => output_dir, :no_clean => false)

but non of the above is helping, still fails. A lot of other pdf documents works great.

My environment:
Rails 4.2
Ruby 2.2
Docsplit 0.7.6
tesseract-ocr 3.03
tesseract-ocr-spa 3.02

Any help please?

tbk303 commented Jan 5, 2016

Check PR #134 that might fix your problem.

# for free to join this conversation on GitHub. Already have an account? # to comment