Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Unicode ecodeError while parsing the PDF files. #17

Open
adityardesai opened this issue Apr 23, 2016 · 4 comments
Open

Unicode ecodeError while parsing the PDF files. #17

adityardesai opened this issue Apr 23, 2016 · 4 comments

Comments

@adityardesai
Copy link

adityardesai commented Apr 23, 2016

Hi

I am using NLTKRest server to parse few of the PDF files from Polar Trec Data and get the required NER quantities. But for most of the PDF files I am seeing the following error from the REST server.

"UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128) // Werkzeug Debugger "

Command used is
curl -X POST -d "PDF TEXT in STRING" http://localhost:8888/nltk.

Error file is attached as well.
nltkrest.txt

@manalishah
Copy link
Collaborator

manalishah commented Apr 23, 2016

yes, thats true @adityardesai
you might want to use this patch until its merged chrismattmann#7
or you could simply build this branch 'encoding-issue' from source

@adityardesai
Copy link
Author

Thanks for letting us know @manalishah . But I tried the patch given and again same error I am seeing. Am I missing any steps, apart from adding
tokenized = nltk.word_tokenize(content.decode("utf-8")) to the server.py. Any specific build commands to run?

@manalishah
Copy link
Collaborator

can you upload any one such pdf file that gives you this error? I can replicate the issue and try to resolve it. @adityardesai

@adityardesai
Copy link
Author

adityardesai commented Apr 24, 2016

Sure @manalishah . Attached is the sample file. I just added tokenized = nltk.word_tokenize(content.decode("utf-8")) to the server.py and re-run the REST server and again same error.
Sample.pdf

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants