-
-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Incorrectly Parsed Object on Microsoft invoice PDF #119
Comments
https://www.pdfen.com/pdf-a-validator gives no errors for the file. https://www.datalogics.com/products/pdftools/pdf-checker/ gives the following output for it, suggesting that the only error is some missing fonts?
|
Hello @DanielJackson-Oslo! I ran the
Two of the stream objects contained in this file are corrupt. This is why pdf-lib throws an error when trying to parse it. That being said, I think it would be possible to adapt pdf-lib's parser to tolerate these specific stream errors. I'll look into this and get back with you. |
I just cut version You can install this prerelease with npm:
It's also available on unpkg:
Please try it out and let me know if it works for you! |
@DanielJackson-Oslo I'd like to add the It looks like it might be a test billing statement? But I can't tell for sure since it's not written in English. |
@Hopding Feel free to use it! It's a bill for my own Office 365, presumably the same one they generate for all customers. Thanks for the quick follow up. Looking forward to 0.6.4 releasing. How stable is the rc? |
@Hopding Since this isn't the first time this sort of problem has come up I'd imagine there will be hundreds, if not thousands, of different ways that PDFs can be malformed but still render in most PDF readers, and thus exist in the wild. I don't know much about the technical nature of PDFs, but for my use case, I'd really only need pdf-lib to recognize where the PDF pages start, and then copy those into a new PDF without further validating them. (All I want to do is merge two PDFs, I don't need any control or understanding of the contents). I see that there's a "copy" function in the library, is that what that function does? If not, could I somehow help write a "merge blindly" function? |
@DanielJackson-Oslo The RC should be perfectly stable. The only change it includes is the fix for this issue. And of course, it passed all the unit and integration tests before I cut it. So if it's working well for you, then there shouldn't be anything to worry about. (I always cut RCs for every release, no matter how trivial the changes). It would certainly be possible to get away with less object parsing (and therefore tolerate more invalid objects) if you just want to copy pages. However, in order to find and copy the page objects (and any other objects they reference) it is still necessary to parse some objects. Implementing this sort of "lazy parsing" would take more than just writing a function, though. It would be necessary to modify some of pdf-lib's parsing code. The parser currently scans input PDFs from start to finish, parsing each object it encounters along the way. If this is something you'd be interested in working on, I'd be open to working with you on it. Just note that it would require learning about the structure of PDF files. Please open a new issue if you'd like to continue the discussion further! |
@Hopding 0.6.4rc1 fixes the issue on my end! 🎉 . Should I close this thread?
I'll read up a bit on PDF structure, and open a new issue for it. Thank you so much for the active help! |
Hi!
Thanks for a really welcome module.
I'm encountering thousands of different kinds of PDFs generated by other people, and got into some trouble with one specific one from Microsoft, getting the following error:
Incorrectly parsed object contents
These are the PDFs that I try to combine, I think the offending one is the top one as it's the only one not generated by Puppeteer:
Din_Microsoft-fakturaoversikt.pdf
3e63ebd0-8775-11e9-888e-1f95e38b402c.pdf
Presumably the PDF doesn't follow the standards, though there's little I can do about that.
My use case is to combine this PDF with a generated page that gives some info about it, for accounting purposes. As such, I don't really need to parse it any more than what's needed to append it to my PDF.
My code looks as follows:
The text was updated successfully, but these errors were encountered: