Use `MurmurHash3_64` when computing the cacheKey for inline images (bug 1799927) #15677

Snuffleupagus · 2022-11-09T22:04:24Z

Please note: This is a tentative patch, which only fixes the "wrong letter" part of bug 1799927.

It appears that the simple computeAdler32 function, used when caching inline images, generates hash collisions for some (very short) TypedArrays. In this case that leads to some of the "letters", which are actually inline images, being rendered incorrectly.
To avoid that we replace it with the MurmurHash3_64 class instead, which is already used in other parts of the code-base. The one disadvantage of doing this is that it's slightly slower, which in some cases will lead to a performance regression.[1] However I believe that we'll have to accept a smaller regression here, since the alternative is much worse (i.e. broken rendering).

One small benefit of these changes is that we can avoid creating lots of Stream-instances for already cached inline images.

[1] Doing some quick benchmarking in the viewer, using #pdfBug=Stats, with the PDF document from issue #2618 shows at least a 10 percent regression. Edit: My first attempt at a patch showed a 20 percent regression, which definitely seemed too much, but some additional clean-up lead to the current version that feels more acceptable.

1799927) *Please note:* This is a tentative patch, which only fixes the "wrong letter" part of bug 1799927. It appears that the simple `computeAdler32` function, used when caching inline images, generates hash collisions for some (very short) TypedArrays. In this case that leads to some of the "letters", which are actually inline images, being rendered incorrectly. To avoid that we replace it with the `MurmurHash3_64` class instead, which is already used in other parts of the code-base. The one disadvantage of doing this is that it's slightly slower, which in some cases will lead to a performance regression.[1] However I believe that we'll have to accept a smaller regression here, since the alternative is much worse (i.e. broken rendering). One small benefit of these changes is that we can avoid creating lots of `Stream`-instances for already cached inline images. --- [1] Doing some quick benchmarking in the viewer, using `#pdfBug=Stats`, with the PDF document from issue 2618 shows at least a 10 percent regression.

This helps to *partially* offset the performance regression from the previous commit, when testing e.g. with the PDF document from issue 2618. Given that we no longer create `Stream`-instances unconditionally, we also don't need `Dict`-instances for cached inline images (since we only access the filter).

….hexdigest` These variables are left-over from the initial implementation, back when `String.prototype.padStart` didn't exist and we thus had to pad manually (using a loop).

pdfjsbot · 2022-11-10T08:15:43Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/14ea6c70c2e23f1/output.txt

pdfjsbot · 2022-11-10T08:15:43Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/3283059d682b28b/output.txt

pdfjsbot · 2022-11-10T08:41:19Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/14ea6c70c2e23f1/output.txt

Total script time: 25.58 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 11

Image differences available at: http://54.241.84.105:8877/14ea6c70c2e23f1/reftest-analyzer.html#web=eq.log

pdfjsbot · 2022-11-10T08:50:28Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/3283059d682b28b/output.txt

Total script time: 34.73 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: FAILED
Regression tests: FAILED

  different ref/snapshot: 1

Image differences available at: http://54.193.163.58:8877/3283059d682b28b/reftest-analyzer.html#web=eq.log

calixteman · 2022-11-10T08:53:43Z

src/core/parser.js

-      // ... and fetch the bytes of the *entire* dictionary.
-      const dictBytes = stream.getBytes(dictLength);
+      // ... and fetch the bytes of the dictionary *and* the inline image.
+      const inlineBytes = stream.getBytes(dictLength + length);


Good catch!

calixteman · 2022-11-10T09:21:34Z

src/core/parser.js

      // Finally, don't forget to reset the stream position.
      stream.pos = initialStreamPos;

-      cacheKey = computeAdler32(imageBytes) + "_" + computeAdler32(dictBytes);
+      const hash = new MurmurHash3_64();


When inlineBytes has a size smaller than 64 (or an other number) we could just use something like:

const MAPPING = [...Array(256).keys()].map(n => String.fromCharCode(n) ); function hash(bytes) { return bytes.reduce((a, x) => a + MAPPING[x], ""); }

If the size is greater than 64 we could use two caches:

cache: key == hash(bytes.slice(0, 64)); value: another cache: key == murmur; value: stream

This way we compute murmur only when there is a diff on the first 64 bytes.

Anyway, whatever the caching strategy is, I think we must check from stream equality at some point else we could have some other hash collisions in the future.

To avoid any memory issues we could just add an entry in the image dictionary to tell that this is a dup of an other image, and then clear the cache once the parsing is done.

When inlineBytes has a size smaller than 64 (or an other number) we could just use something like:

Honestly I kind of prefer the simplicity of the current patch, with just the one code-path to worry about.
For most documents this simply doesn't matter performance wise, and then the second commit improves things for the one really bad case we know about.

Anyway, whatever the caching strategy is, I think we must check from stream equality at some point else we could have some other hash collisions in the future.

I think that's what we're finally doing in the code now, with this PR, unless I'm misunderstanding you.

To avoid any memory issues we could just add an entry in the image dictionary to tell that this is a dup of an other image, and then clear the cache once the parsing is done.

Given that the inline image cache is bound to the current Parser instance, which is limited to e.g. each getOperatorList instance, this should already be going away when parsing is done; unless I'm misunderstanding what you mean here.

I think that's what we're finally doing in the code now, with this PR, unless I'm misunderstanding you.

I don't think so. I just changed cacheKey = hash.hexdigest(); to cacheKey = "123"; to create some collisions and all the images are the same (it was ... for stream equality... (not from) in my sentence).
With a better hasher, you just decrease the probability to have a collision but you don't avoid them.
It was why I proposed this way to do with small images: we've a hash which is unique, hence we don't need to check for equality (the key equality check is done on the native code's side).

Sorry, but I don't really understand what you mean here.

Please keep in mind that since Streams are never cached on the XRef instance, for a number of reasons[1], running e.g. makeSubStream will always create a new Stream-instance and you thus can't directly compare them.
Hence why we need to compute some kind of hash from the data relevant to each inline image.

[1] These include: Memory usage since DecodeStreams can be very large once decoded, and concurrently accessing the same Stream-instance from e.g. multiple getOperatorList invocations wouldn't work (given that parsing is async).

By "comparing stream" I meant comparing the "comparing inlineBytes", sorry about not having be more clear.
We could have two different inlineBytes with the same murmur or whatever hash and in this case it'll lead to render the same image for both, it's what it's happening when replacing hash.digest() with "123".

I agree with you about having a very low probability to have the same hash for two different images.
But, if in 1 day, 1month, 1 year, 1 decade, ... someone a file bug because the rendering is not correct, what will we do? Find an other way to hash which fixes the bug and doesn't introduce any regressions, that's fine but I'll write exactly the same comment as this one.
We're lucky because someone noticed the bug, but I'm pretty sure that some people read some pdfs without noticing the bug because they just don't know what is the real correct rendering and hence they're maybe reading some wrong information. As said the probability is likely very low but we've ton of users/pdfs in the world so it's maybe not negligible.

So I'm fine to take this patch because it's fixing something and because it's an improvement (decreasing the probability to have a collision is an improvement) but we really must write a follow-up to fix the real issue (which is only about checking bytes equality).

Do you think it would suffice to only check the bytes for "short" images, since you used 64 in your example above?
If so, the overall performance impact of better checks would probably be considerably lower in general.

Ideally we should check equality for all images we cache: the hash would be just here to avoid useless comparisons.
I don't know how MAX_LENGTH_TO_CACHE has been guessed, but we could change it to 64/128, ... (we can make some stats to try figure out a good number from the pdfs in our test suite). Then generate a string from the inlineBytes (the shorter the better) and use this string as a key. This way the hashing stuff and the key comparison will be done in the js engine and we'll have to only support the toString conversion and the memory overhead.

We can even keep 1000 (I'd prefer 1024 ;)) and see how it behaves irl and just verify that we correctly get rid of the Parser object to be sure we don't keep some memory. I'd tend to think that having multiple identical 1K-image should be pretty rare.
We could have a memory inflation in case we've a lot of unique small images: to avoid that we can limit the cache size and keep some stats on how many dups an image has and then when we reach the cache limit we just remove an entry with no dups.

I don't know how MAX_LENGTH_TO_CACHE has been guessed, but we could change it to 64/128,

Unfortunately those values are way too low in practice, and could render the caching pointless.
Note that according to the PDF specification, inline images are allowed to be up to 4KB in size.

I've got another patch which simply caches using the stringified inline image data, and it seems to perform well enough in practice with the PDF documents I've tested.

We could have a memory inflation in case we've a lot of unique small images: to avoid that we can limit the cache size and keep some stats on how many dups an image has and then when we reach the cache limit we just remove an entry with no dups.

Do you consider this a blocker, or can we implement that if the need ever arises (e.g. a report about bad memory usage)?
As-is the new approach should at least prevent any rendering errors because of hash collisions.

No this isn't a blocker and yes we'll wait for either a bug report or if we notice ourselves something wrong in profiling a pdf. Or if one of us has a good idea to implement a good caching strategy ...

Snuffleupagus added the core label Nov 9, 2022

Snuffleupagus force-pushed the bug-1799927 branch from 05ea8b2 to 0eefc4a Compare November 9, 2022 22:05

Snuffleupagus added 2 commits November 10, 2022 09:01

Remove a couple of unnecessary temporary variables in `MurmurHash3_64…

05dd373

….hexdigest` These variables are left-over from the initial implementation, back when `String.prototype.padStart` didn't exist and we thus had to pad manually (using a loop).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `MurmurHash3_64` when computing the cacheKey for inline images (bug 1799927) #15677

Use `MurmurHash3_64` when computing the cacheKey for inline images (bug 1799927) #15677

Snuffleupagus commented Nov 9, 2022 •

edited

Loading

pdfjsbot commented Nov 10, 2022

pdfjsbot commented Nov 10, 2022

pdfjsbot commented Nov 10, 2022

pdfjsbot commented Nov 10, 2022

calixteman Nov 10, 2022

calixteman Nov 10, 2022

Snuffleupagus Nov 10, 2022

calixteman Nov 10, 2022

Snuffleupagus Nov 10, 2022

calixteman Nov 10, 2022

calixteman Nov 10, 2022 •

edited

Loading

Snuffleupagus Nov 10, 2022

calixteman Nov 10, 2022

Snuffleupagus Nov 10, 2022

calixteman Nov 10, 2022 •

edited

Loading

Use MurmurHash3_64 when computing the cacheKey for inline images (bug 1799927) #15677

Use MurmurHash3_64 when computing the cacheKey for inline images (bug 1799927) #15677

Conversation

Snuffleupagus commented Nov 9, 2022 • edited Loading

pdfjsbot commented Nov 10, 2022

From: Bot.io (Linux m4)

Received

pdfjsbot commented Nov 10, 2022

From: Bot.io (Windows)

Received

pdfjsbot commented Nov 10, 2022

From: Bot.io (Linux m4)

Failed

pdfjsbot commented Nov 10, 2022

From: Bot.io (Windows)

Failed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calixteman Nov 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calixteman Nov 10, 2022 • edited Loading

Choose a reason for hiding this comment

Use `MurmurHash3_64` when computing the cacheKey for inline images (bug 1799927) #15677

Use `MurmurHash3_64` when computing the cacheKey for inline images (bug 1799927) #15677

Snuffleupagus commented Nov 9, 2022 •

edited

Loading

calixteman Nov 10, 2022 •

edited

Loading

calixteman Nov 10, 2022 •

edited

Loading