Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Take the dictionary, and not just the image data, into account when caching inline images (issue 9398) #9420

Merged
merged 1 commit into from
Feb 13, 2018
Merged

Conversation

Snuffleupagus
Copy link
Collaborator

The reason for the bug is that we're only computing a checksum of the image data itself, but completely ignore the inline dictionary. The latter is important, since in practice it's not uncommon for inline images to be identical but use e.g. different ColourSpaces.

There's obviously a couple of different ways that we could compute a hash/checksum of the dictionary.
Initially I tried using MurmurHash3_64 to compute a hash of the keys/values in the dictionary. Unfortunately this approach turned out to be way too slow in practice, especially for PDF files with a huge number of inline images; in particular issue #2618 would regresses quite badly with this solution.

The solution that is instead implemented in this patch, is to compute a checksum of the dictionary contents. While this is a much simpler, not to mention a lot more efficient, solution there's one drawback associated with it:
If the contents of inline image dictionaries are ordered differently, they will not be considered equal with this approach which could thus lead to failures to cache repeated inline images. In practice this doesn't seem to be a problem in any of the PDF files I've tested, and generally I'd rather err on the side of not caching given that too aggressive caching can easily lead to rendering bugs.
One small, but somewhat annoying, complication is that by the time Parser.makeInlineImage is called, we no longer know the exact stream position where the inline image dictionary starts. Having access to that information is crucial here, and the easiest solution I could come up with is to track this in the current Lexer instance.[1]

With the patch, we're thus able to fix the referenced issues without incurring large regressions in problematic cases such as issue #2618.

Fixes #9398; also improves/fixes the issue8823 reference test.


[1] Obviously I'd have preferred if this patch could be limited to Parser.makeInlineImage, without the need for this "hack", but I'm not sure what that'd look like here.

@mozilla mozilla deleted a comment from pdfjsbot Feb 1, 2018
@mozilla mozilla deleted a comment from pdfjsbot Feb 1, 2018
@mozilla mozilla deleted a comment from pdfjsbot Feb 1, 2018
@mozilla mozilla deleted a comment from pdfjsbot Feb 1, 2018
@mozilla mozilla deleted a comment from pdfjsbot Feb 3, 2018
@mozilla mozilla deleted a comment from pdfjsbot Feb 3, 2018
@mozilla mozilla deleted a comment from pdfjsbot Feb 3, 2018
@mozilla mozilla deleted a comment from pdfjsbot Feb 3, 2018
@brendandahl
Copy link
Contributor

Do we need the test case if issue 8823 seems to test it as well?

@Snuffleupagus
Copy link
Collaborator Author

Do we need the test case if issue 8823 seems to test it as well?

I suppose not; it's been removed now :-)

@mozilla mozilla deleted a comment from pdfjsbot Feb 7, 2018
@mozilla mozilla deleted a comment from pdfjsbot Feb 7, 2018
@mozilla mozilla deleted a comment from pdfjsbot Feb 7, 2018
@mozilla mozilla deleted a comment from pdfjsbot Feb 7, 2018
…aching inline images (issue 9398)

The reason for the bug is that we're only computing a checksum of the image data itself, but completely ignore the inline dictionary. The latter is important, since in practice it's not uncommon for inline images to be identical but use e.g. different ColourSpaces.

There's obviously a couple of different ways that we could compute a hash/checksum of the dictionary.
Initially I tried using `MurmurHash3_64` to compute a hash of the keys/values in the dictionary. Unfortunately this approach turned out to be *way* too slow in practice, especially for PDF files with a huge number of inline images; in particular issue 2618 would regresses quite badly with this solution.

The solution that is instead implemented in this patch, is to compute a checksum of the dictionary contents. While this is a much simpler, not to mention a lot more efficient, solution there's one drawback associated with it:
If the contents of inline image dictionaries are ordered differently, they will not be considered equal with this approach which could thus lead to failures to cache repeated inline images. In practice this doesn't seem to be a problem in any of the PDF files I've tested, and generally I'd rather err on the side of *not* caching given that too aggressive caching can easily lead to rendering bugs.

One small, but somewhat annoying, complication is that by the time `Parser.makeInlineImage` is called, we no longer know the *exact* stream position where the inline image dictionary starts. Having access to that information is crucial here, and the easiest solution I could come up with is to track this in the current `Lexer` instance.[1]

With the patch, we're thus able to fix the referenced issues without incurring large regressions in problematic cases such as issue 2618.

Fixes 9398; also improves/fixes the `issue8823` reference test.

---

[1] Obviously I'd have preferred if this patch could be limited to `Parser.makeInlineImage`, without the need for this "hack", but I'm not sure what that'd look like here.
@mozilla mozilla deleted a comment from pdfjsbot Feb 12, 2018
@mozilla mozilla deleted a comment from pdfjsbot Feb 12, 2018
@mozilla mozilla deleted a comment from pdfjsbot Feb 12, 2018
@mozilla mozilla deleted a comment from pdfjsbot Feb 12, 2018
@Snuffleupagus
Copy link
Collaborator Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/f3579ea43512af0/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/d7f2b931b25c49c/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/f3579ea43512af0/output.txt

Total script time: 24.36 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/f3579ea43512af0/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/d7f2b931b25c49c/output.txt

Total script time: 39.26 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/d7f2b931b25c49c/reftest-analyzer.html#web=eq.log

@Snuffleupagus
Copy link
Collaborator Author

@brendandahl Ping; your comment in #9420 (comment) has been addressed, any chance that you have time to review this again?

@brendandahl
Copy link
Contributor

/botio makeref

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_makeref from @brendandahl received. Current queue size: 0

Live output at: http://54.215.176.217:8877/78ad8f8a3034b43/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_makeref from @brendandahl received. Current queue size: 0

Live output at: http://54.67.70.0:8877/7b072baebe204ca/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/78ad8f8a3034b43/output.txt

Total script time: 22.04 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/7b072baebe204ca/output.txt

Total script time: 37.13 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@timvandermeij timvandermeij merged commit 2e780d4 into mozilla:master Feb 13, 2018
@timvandermeij
Copy link
Contributor

Nice work!

@Snuffleupagus Snuffleupagus deleted the makeInlineImage-dict branch February 14, 2018 09:17
movsb pushed a commit to movsb/pdf.js that referenced this pull request Jul 14, 2018
Take the dictionary, and not just the image data, into account when caching inline images (issue 9398)
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression in issue #8496 with canvas backend
4 participants