Take the dictionary, and not just the image data, into account when caching inline images (issue 9398) #9420

Snuffleupagus · 2018-01-30T13:15:43Z

The reason for the bug is that we're only computing a checksum of the image data itself, but completely ignore the inline dictionary. The latter is important, since in practice it's not uncommon for inline images to be identical but use e.g. different ColourSpaces.

There's obviously a couple of different ways that we could compute a hash/checksum of the dictionary.
Initially I tried using MurmurHash3_64 to compute a hash of the keys/values in the dictionary. Unfortunately this approach turned out to be way too slow in practice, especially for PDF files with a huge number of inline images; in particular issue #2618 would regresses quite badly with this solution.

The solution that is instead implemented in this patch, is to compute a checksum of the dictionary contents. While this is a much simpler, not to mention a lot more efficient, solution there's one drawback associated with it:
If the contents of inline image dictionaries are ordered differently, they will not be considered equal with this approach which could thus lead to failures to cache repeated inline images. In practice this doesn't seem to be a problem in any of the PDF files I've tested, and generally I'd rather err on the side of not caching given that too aggressive caching can easily lead to rendering bugs.
One small, but somewhat annoying, complication is that by the time Parser.makeInlineImage is called, we no longer know the exact stream position where the inline image dictionary starts. Having access to that information is crucial here, and the easiest solution I could come up with is to track this in the current Lexer instance.[1]

With the patch, we're thus able to fix the referenced issues without incurring large regressions in problematic cases such as issue #2618.

Fixes #9398; also improves/fixes the issue8823 reference test.

[1] Obviously I'd have preferred if this patch could be limited to Parser.makeInlineImage, without the need for this "hack", but I'm not sure what that'd look like here.

brendandahl · 2018-02-07T01:18:09Z

Do we need the test case if issue 8823 seems to test it as well?

Snuffleupagus · 2018-02-07T09:11:46Z

Do we need the test case if issue 8823 seems to test it as well?

I suppose not; it's been removed now :-)

…aching inline images (issue 9398) The reason for the bug is that we're only computing a checksum of the image data itself, but completely ignore the inline dictionary. The latter is important, since in practice it's not uncommon for inline images to be identical but use e.g. different ColourSpaces. There's obviously a couple of different ways that we could compute a hash/checksum of the dictionary. Initially I tried using `MurmurHash3_64` to compute a hash of the keys/values in the dictionary. Unfortunately this approach turned out to be *way* too slow in practice, especially for PDF files with a huge number of inline images; in particular issue 2618 would regresses quite badly with this solution. The solution that is instead implemented in this patch, is to compute a checksum of the dictionary contents. While this is a much simpler, not to mention a lot more efficient, solution there's one drawback associated with it: If the contents of inline image dictionaries are ordered differently, they will not be considered equal with this approach which could thus lead to failures to cache repeated inline images. In practice this doesn't seem to be a problem in any of the PDF files I've tested, and generally I'd rather err on the side of *not* caching given that too aggressive caching can easily lead to rendering bugs. One small, but somewhat annoying, complication is that by the time `Parser.makeInlineImage` is called, we no longer know the *exact* stream position where the inline image dictionary starts. Having access to that information is crucial here, and the easiest solution I could come up with is to track this in the current `Lexer` instance.[1] With the patch, we're thus able to fix the referenced issues without incurring large regressions in problematic cases such as issue 2618. Fixes 9398; also improves/fixes the `issue8823` reference test. --- [1] Obviously I'd have preferred if this patch could be limited to `Parser.makeInlineImage`, without the need for this "hack", but I'm not sure what that'd look like here.

Snuffleupagus · 2018-02-12T22:13:05Z

/botio test

pdfjsbot · 2018-02-12T22:13:06Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/f3579ea43512af0/output.txt

pdfjsbot · 2018-02-12T22:13:06Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/d7f2b931b25c49c/output.txt

pdfjsbot · 2018-02-12T22:37:28Z

From: Bot.io (Windows)

Failed

Full output at http://54.215.176.217:8877/f3579ea43512af0/output.txt

Total script time: 24.36 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/f3579ea43512af0/reftest-analyzer.html#web=eq.log

pdfjsbot · 2018-02-12T22:52:22Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.67.70.0:8877/d7f2b931b25c49c/output.txt

Total script time: 39.26 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/d7f2b931b25c49c/reftest-analyzer.html#web=eq.log

Snuffleupagus · 2018-02-13T20:46:52Z

@brendandahl Ping; your comment in #9420 (comment) has been addressed, any chance that you have time to review this again?

brendandahl · 2018-02-13T22:14:54Z

/botio makeref

pdfjsbot · 2018-02-13T22:14:55Z

From: Bot.io (Windows)

Received

Command cmd_makeref from @brendandahl received. Current queue size: 0

Live output at: http://54.215.176.217:8877/78ad8f8a3034b43/output.txt

pdfjsbot · 2018-02-13T22:14:55Z

From: Bot.io (Linux m4)

Received

Command cmd_makeref from @brendandahl received. Current queue size: 0

Live output at: http://54.67.70.0:8877/7b072baebe204ca/output.txt

pdfjsbot · 2018-02-13T22:36:58Z

From: Bot.io (Windows)

Success

Full output at http://54.215.176.217:8877/78ad8f8a3034b43/output.txt

Total script time: 22.04 mins

Lint: Passed
Make references: Passed
Check references: Passed

pdfjsbot · 2018-02-13T22:52:04Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/7b072baebe204ca/output.txt

Total script time: 37.13 mins

Lint: Passed
Make references: Passed
Check references: Passed

timvandermeij · 2018-02-13T23:05:48Z

Nice work!

Take the dictionary, and not just the image data, into account when caching inline images (issue 9398)

Snuffleupagus added the core label Jan 30, 2018

mozilla deleted a comment from pdfjsbot Feb 1, 2018

mozilla deleted a comment from pdfjsbot Feb 3, 2018

Snuffleupagus requested a review from brendandahl February 6, 2018 19:37

mozilla deleted a comment from pdfjsbot Feb 7, 2018

mozilla deleted a comment from pdfjsbot Feb 12, 2018

brendandahl approved these changes Feb 13, 2018

View reviewed changes

timvandermeij merged commit 2e780d4 into mozilla:master Feb 13, 2018

Snuffleupagus deleted the makeInlineImage-dict branch February 14, 2018 09:17

movsb pushed a commit to movsb/pdf.js that referenced this pull request Jul 14, 2018

Merge pull request mozilla#9420 from Snuffleupagus/makeInlineImage-dict

a029596

Take the dictionary, and not just the image data, into account when caching inline images (issue 9398)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take the dictionary, and not just the image data, into account when caching inline images (issue 9398) #9420

Take the dictionary, and not just the image data, into account when caching inline images (issue 9398) #9420

Snuffleupagus commented Jan 30, 2018

brendandahl commented Feb 7, 2018

Snuffleupagus commented Feb 7, 2018

Snuffleupagus commented Feb 12, 2018

pdfjsbot commented Feb 12, 2018

pdfjsbot commented Feb 12, 2018

pdfjsbot commented Feb 12, 2018

pdfjsbot commented Feb 12, 2018

Snuffleupagus commented Feb 13, 2018

brendandahl commented Feb 13, 2018

pdfjsbot commented Feb 13, 2018

pdfjsbot commented Feb 13, 2018

pdfjsbot commented Feb 13, 2018

pdfjsbot commented Feb 13, 2018

timvandermeij commented Feb 13, 2018

Take the dictionary, and not just the image data, into account when caching inline images (issue 9398) #9420

Take the dictionary, and not just the image data, into account when caching inline images (issue 9398) #9420

Conversation

Snuffleupagus commented Jan 30, 2018

brendandahl commented Feb 7, 2018

Snuffleupagus commented Feb 7, 2018

Snuffleupagus commented Feb 12, 2018

pdfjsbot commented Feb 12, 2018

From: Bot.io (Windows)

Received

pdfjsbot commented Feb 12, 2018

From: Bot.io (Linux m4)

Received

pdfjsbot commented Feb 12, 2018

From: Bot.io (Windows)

Failed

pdfjsbot commented Feb 12, 2018

From: Bot.io (Linux m4)

Failed

Snuffleupagus commented Feb 13, 2018

brendandahl commented Feb 13, 2018

pdfjsbot commented Feb 13, 2018

From: Bot.io (Windows)

Received

pdfjsbot commented Feb 13, 2018

From: Bot.io (Linux m4)

Received

pdfjsbot commented Feb 13, 2018

From: Bot.io (Windows)

Success

pdfjsbot commented Feb 13, 2018

From: Bot.io (Linux m4)

Success

timvandermeij commented Feb 13, 2018