Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Incorrectly Parsed Object on Microsoft invoice PDF #119

Closed
DanielJackson-Oslo opened this issue Jun 5, 2019 · 8 comments
Closed

Incorrectly Parsed Object on Microsoft invoice PDF #119

DanielJackson-Oslo opened this issue Jun 5, 2019 · 8 comments

Comments

@DanielJackson-Oslo
Copy link

DanielJackson-Oslo commented Jun 5, 2019

Hi!

Thanks for a really welcome module.

I'm encountering thousands of different kinds of PDFs generated by other people, and got into some trouble with one specific one from Microsoft, getting the following error:

Incorrectly parsed object contents

These are the PDFs that I try to combine, I think the offending one is the top one as it's the only one not generated by Puppeteer:
Din_Microsoft-fakturaoversikt.pdf
3e63ebd0-8775-11e9-888e-1f95e38b402c.pdf

Presumably the PDF doesn't follow the standards, though there's little I can do about that.

My use case is to combine this PDF with a generated page that gives some info about it, for accounting purposes. As such, I don't really need to parse it any more than what's needed to append it to my PDF.

My code looks as follows:

// pdfsToMerge is an array of filePaths
async function mergePdfs(pdfsToMerge, filePath) {
  const mergedPdf = PDFDocumentFactory.create();
  pdfsToMerge.forEach(pdfFilePath => {
    const pdf = fs.readFileSync(pdfFilePath)
    const pagesToMerge = PDFDocumentFactory.load(pdf).getPages()
    pagesToMerge.forEach( page => {
      mergedPdf.addPage(page)
    })
  })
  const mergedPdfFile = await PDFDocumentWriter.saveToBytes(mergedPdf)
  await fs.writeFileSync(filePath, mergedPdfFile)
  logger.verbose("Merged PDFs", { mergedPdfs: pdfsToMerge, filePath });
  return
}
@DanielJackson-Oslo
Copy link
Author

DanielJackson-Oslo commented Jun 5, 2019

https://www.pdfen.com/pdf-a-validator gives no errors for the file.

https://www.datalogics.com/products/pdftools/pdf-checker/ gives the following output for it, suggesting that the only error is some missing fonts?

PDF Checker 1.4.1  Copyright 2018-2019 Datalogics, Inc. All Rights Reserved

Wed Jun  5 04:31:19 2019

JSON Profile: everything.json

Input Document: DinMicrosoft-fakturaoversikt.pdf

<<=CHECKER_SUMMARY_START=>>
fonts:uses-base14fonts-not-embedded
<<=CHECKER_SUMMARY_END=>>

General Results
    Errors:
        None
    Information:
        None
    Checks Completed:
        claims-pdfa-conformance
        contains-owner-password
        contains-signature
        damaged
        password-protected
        pdf-v2
        unable-to-open
        xfa-type

Userdata Results
    Errors:
        None
    Information:
        None
    Checks Completed:
        contains-annots
        contains-annots-not-for-printing
        contains-annots-not-for-viewing
        contains-annots-without-normal-appearances
        contains-embedded-files
        contains-metadata
        contains-optional-content
        contains-private-data
        contains-transparency

Fonts Results
    Errors:
        Uses Base 14 fonts not embedded in document: 
            Helvetica (1 instance)
            Helvetica-Bold (1 instance)
    Information:
        None
    Checks Completed:
        fontdescriptor-missing-capheight
        fontdescriptor-missing-fields
        uses-base14fonts-not-embedded
        uses-fonts-fully-embedded
        uses-fonts-not-embedded

Objects Results
    Errors:
        None
    Information:
        None
    Checks Completed:
        contains-javascript-actions
        contains-thumbnails

Cleanup Results
    Errors:
        None
    Information:
        None
    Checks Completed:
        suboptimal-compression

Image Results
    Errors:
        None
    Information:
        None
    Checks Completed:
        alternate-images

    Color Images
    Errors:
        None
    Information:
        None
    Checks Completed:
        image-depth
        resolution-too-high
        resolution-too-low
        uses-jpeg2000-compression

    Grayscale Images
    Errors:
        None
    Information:
        None
    Checks Completed:
        resolution-too-high
        resolution-too-low
        uses-jpeg2000-compression

    Monochrome Images
    Errors:
        None
    Information:
        None
    Checks Completed:
        resolution-too-high
        resolution-too-low
        uses-jbig2-compression


@Hopding
Copy link
Owner

Hopding commented Jun 6, 2019

Hello @DanielJackson-Oslo!

I ran the Din_Microsoft-fakturaoversikt.pdf file you shared through qpdf (a very useful PDF validation tool). It turns out the file is technically invalid:

$ qpdf --check ~/Din_Microsoft-fakturaoversikt.pdf
checking ~/Din_Microsoft-fakturaoversikt.pdf
PDF Version: 1.3
File is not encrypted
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 6 0, offset 121): stream keyword not followed by proper line terminator
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 6 0, offset 1971): expected endstream
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 6 0, offset 121): attempting to recover stream length
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 6 0, offset 121): recovered stream length: 1854
File is not linearized
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 9 0, offset 14325): stream keyword not followed by proper line terminator
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 9 0, offset 15835): expected endstream
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 9 0, offset 14325): attempting to recover stream length
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (object 9 0, offset 14325): recovered stream length: 1514
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (offset 121): error decoding stream data for object 6 0: stream inflate: inflate: data: incorrect header check
page 1: content stream (content stream object 6 0): errors while decoding content stream
WARNING: /Users/user/Din_Microsoft-fakturaoversikt.pdf (offset 14325): error decoding stream data for object 9 0: stream inflate: inflate: data: incorrect header check
page 2: content stream (content stream object 9 0): errors while decoding content stream

Two of the stream objects contained in this file are corrupt. This is why pdf-lib throws an error when trying to parse it.

That being said, I think it would be possible to adapt pdf-lib's parser to tolerate these specific stream errors. I'll look into this and get back with you.

@Hopding
Copy link
Owner

Hopding commented Jun 7, 2019

I just cut version 0.6.4-rc1 of pdf-lib. It contains a fix for this issue.

You can install this prerelease with npm:

npm install pdf-lib@0.6.4-rc1

It's also available on unpkg:

Please try it out and let me know if it works for you!

@Hopding
Copy link
Owner

Hopding commented Jun 7, 2019

@DanielJackson-Oslo I'd like to add the Din_Microsoft-fakturaoversikt.pdf file you shared to the pdf-lib GitHub repo to create a regression test for this issue. Do you mind? Does the file contain any sensitive information?

It looks like it might be a test billing statement? But I can't tell for sure since it's not written in English.

Hopding added a commit that referenced this issue Jun 8, 2019
@DanielJackson-Oslo
Copy link
Author

@DanielJackson-Oslo I'd like to add the Din_Microsoft-fakturaoversikt.pdf file you shared to the pdf-lib GitHub repo to create a regression test for this issue. Do you mind? Does the file contain any sensitive information?

@Hopding Feel free to use it! It's a bill for my own Office 365, presumably the same one they generate for all customers.

Thanks for the quick follow up. Looking forward to 0.6.4 releasing. How stable is the rc?

@DanielJackson-Oslo
Copy link
Author

DanielJackson-Oslo commented Jun 9, 2019

@Hopding Since this isn't the first time this sort of problem has come up I'd imagine there will be hundreds, if not thousands, of different ways that PDFs can be malformed but still render in most PDF readers, and thus exist in the wild.

I don't know much about the technical nature of PDFs, but for my use case, I'd really only need pdf-lib to recognize where the PDF pages start, and then copy those into a new PDF without further validating them. (All I want to do is merge two PDFs, I don't need any control or understanding of the contents).

I see that there's a "copy" function in the library, is that what that function does? If not, could I somehow help write a "merge blindly" function?

@Hopding
Copy link
Owner

Hopding commented Jun 9, 2019

@DanielJackson-Oslo The RC should be perfectly stable. The only change it includes is the fix for this issue. And of course, it passed all the unit and integration tests before I cut it. So if it's working well for you, then there shouldn't be anything to worry about. (I always cut RCs for every release, no matter how trivial the changes).

It would certainly be possible to get away with less object parsing (and therefore tolerate more invalid objects) if you just want to copy pages. However, in order to find and copy the page objects (and any other objects they reference) it is still necessary to parse some objects.

Implementing this sort of "lazy parsing" would take more than just writing a function, though. It would be necessary to modify some of pdf-lib's parsing code. The parser currently scans input PDFs from start to finish, parsing each object it encounters along the way.

If this is something you'd be interested in working on, I'd be open to working with you on it. Just note that it would require learning about the structure of PDF files. Please open a new issue if you'd like to continue the discussion further!

@DanielJackson-Oslo
Copy link
Author

@Hopding 0.6.4rc1 fixes the issue on my end! 🎉 . Should I close this thread?

If this is something you'd be interested in working on, I'd be open to working with you on it. Just note that it would require learning about the structure of PDF files. Please open a new issue if you'd like to continue the discussion further!

I'll read up a bit on PDF structure, and open a new issue for it. Thank you so much for the active help!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants