Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Facing Error for type: [[ application/octet-stream ]] while extracting from URL #5

Open
Manas-E opened this issue Feb 19, 2022 · 3 comments

Comments

@Manas-E
Copy link

Manas-E commented Feb 19, 2022

I wanted to extract text from an URL, URL contains a pdf file that is hosted on firebase.
Now I'm facing this issue with URL only it is working correctly with local pdf file.

Here are the logs:
2022-02-19T16:31:26+0530 <error> processing.js:97 () Error: Error for type: [[ application/octet-stream ]], file: [[ C:\Users\lenovo\AppData\Local\Temp\60980467322.pdf ]] at extract (C:\Users\lenovo\Desktop\React\Open-Source\linkin_port-\node_modules\textract\lib\extract.js:151:15) at Timeout._onTimeout (C:\Users\lenovo\Desktop\React\Open-Source\linkin_port-\node_modules\textract\lib\extract.js:159:7) at listOnTimeout (node:internal/timers:557:17) at processTimers (node:internal/timers:500:7) { typeNotFound: true }

version : "@nosferatu500/textract": "^3.0.3"

Please suggest what can be done @nosferatu500

@nosferatu500
Copy link
Owner

Can you check if this works with the original library?
https://github.com/dbashford/textract

I just updated a few deps in this fork to fix the issues with CVE found.

If it works with the original library but not with this fork, I can take a look in a few days.

@Manas-E
Copy link
Author

Manas-E commented Feb 19, 2022

@nosferatu500 nope it has the same error in the original also, I tried your version to tackle that error but it still persists
If you could tell me what this error means, I can try debugging it somehow

@Manas-E
Copy link
Author

Manas-E commented Feb 19, 2022

@nosferatu500 I debugged into it and found that it was passing type as binary file but actually the file type is pdf. So for URLs we have to pass typeoverride option and provide type of the file, like this:

textract.fromUrl(FIREBASE_URL,{ typeOverride: "application/pdf"}, function( error, text ) { console.log("started") console.log(text,error) })

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants