-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Take the dictionary, and not just the image data, into account when caching inline images (issue 9398) #9420
Take the dictionary, and not just the image data, into account when caching inline images (issue 9398) #9420
Conversation
Do we need the test case if issue 8823 seems to test it as well? |
I suppose not; it's been removed now :-) |
…aching inline images (issue 9398) The reason for the bug is that we're only computing a checksum of the image data itself, but completely ignore the inline dictionary. The latter is important, since in practice it's not uncommon for inline images to be identical but use e.g. different ColourSpaces. There's obviously a couple of different ways that we could compute a hash/checksum of the dictionary. Initially I tried using `MurmurHash3_64` to compute a hash of the keys/values in the dictionary. Unfortunately this approach turned out to be *way* too slow in practice, especially for PDF files with a huge number of inline images; in particular issue 2618 would regresses quite badly with this solution. The solution that is instead implemented in this patch, is to compute a checksum of the dictionary contents. While this is a much simpler, not to mention a lot more efficient, solution there's one drawback associated with it: If the contents of inline image dictionaries are ordered differently, they will not be considered equal with this approach which could thus lead to failures to cache repeated inline images. In practice this doesn't seem to be a problem in any of the PDF files I've tested, and generally I'd rather err on the side of *not* caching given that too aggressive caching can easily lead to rendering bugs. One small, but somewhat annoying, complication is that by the time `Parser.makeInlineImage` is called, we no longer know the *exact* stream position where the inline image dictionary starts. Having access to that information is crucial here, and the easiest solution I could come up with is to track this in the current `Lexer` instance.[1] With the patch, we're thus able to fix the referenced issues without incurring large regressions in problematic cases such as issue 2618. Fixes 9398; also improves/fixes the `issue8823` reference test. --- [1] Obviously I'd have preferred if this patch could be limited to `Parser.makeInlineImage`, without the need for this "hack", but I'm not sure what that'd look like here.
/botio test |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.215.176.217:8877/f3579ea43512af0/output.txt |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/d7f2b931b25c49c/output.txt |
From: Bot.io (Windows)FailedFull output at http://54.215.176.217:8877/f3579ea43512af0/output.txt Total script time: 24.36 mins
Image differences available at: http://54.215.176.217:8877/f3579ea43512af0/reftest-analyzer.html#web=eq.log |
From: Bot.io (Linux m4)FailedFull output at http://54.67.70.0:8877/d7f2b931b25c49c/output.txt Total script time: 39.26 mins
Image differences available at: http://54.67.70.0:8877/d7f2b931b25c49c/reftest-analyzer.html#web=eq.log |
@brendandahl Ping; your comment in #9420 (comment) has been addressed, any chance that you have time to review this again? |
/botio makeref |
From: Bot.io (Windows)ReceivedCommand cmd_makeref from @brendandahl received. Current queue size: 0 Live output at: http://54.215.176.217:8877/78ad8f8a3034b43/output.txt |
From: Bot.io (Linux m4)ReceivedCommand cmd_makeref from @brendandahl received. Current queue size: 0 Live output at: http://54.67.70.0:8877/7b072baebe204ca/output.txt |
From: Bot.io (Windows)SuccessFull output at http://54.215.176.217:8877/78ad8f8a3034b43/output.txt Total script time: 22.04 mins
|
From: Bot.io (Linux m4)SuccessFull output at http://54.67.70.0:8877/7b072baebe204ca/output.txt Total script time: 37.13 mins
|
Nice work! |
Take the dictionary, and not just the image data, into account when caching inline images (issue 9398)
The reason for the bug is that we're only computing a checksum of the image data itself, but completely ignore the inline dictionary. The latter is important, since in practice it's not uncommon for inline images to be identical but use e.g. different ColourSpaces.
There's obviously a couple of different ways that we could compute a hash/checksum of the dictionary.
Initially I tried using
MurmurHash3_64
to compute a hash of the keys/values in the dictionary. Unfortunately this approach turned out to be way too slow in practice, especially for PDF files with a huge number of inline images; in particular issue #2618 would regresses quite badly with this solution.The solution that is instead implemented in this patch, is to compute a checksum of the dictionary contents. While this is a much simpler, not to mention a lot more efficient, solution there's one drawback associated with it:
If the contents of inline image dictionaries are ordered differently, they will not be considered equal with this approach which could thus lead to failures to cache repeated inline images. In practice this doesn't seem to be a problem in any of the PDF files I've tested, and generally I'd rather err on the side of not caching given that too aggressive caching can easily lead to rendering bugs.
One small, but somewhat annoying, complication is that by the time
Parser.makeInlineImage
is called, we no longer know the exact stream position where the inline image dictionary starts. Having access to that information is crucial here, and the easiest solution I could come up with is to track this in the currentLexer
instance.[1]With the patch, we're thus able to fix the referenced issues without incurring large regressions in problematic cases such as issue #2618.
Fixes #9398; also improves/fixes the
issue8823
reference test.[1] Obviously I'd have preferred if this patch could be limited to
Parser.makeInlineImage
, without the need for this "hack", but I'm not sure what that'd look like here.