Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Use MurmurHash3_64 when computing the cacheKey for inline images (bug 1799927) #15677

Closed
wants to merge 3 commits into from

Conversation

Snuffleupagus
Copy link
Collaborator

@Snuffleupagus Snuffleupagus commented Nov 9, 2022

Please note: This is a tentative patch, which only fixes the "wrong letter" part of bug 1799927.

It appears that the simple computeAdler32 function, used when caching inline images, generates hash collisions for some (very short) TypedArrays. In this case that leads to some of the "letters", which are actually inline images, being rendered incorrectly.
To avoid that we replace it with the MurmurHash3_64 class instead, which is already used in other parts of the code-base. The one disadvantage of doing this is that it's slightly slower, which in some cases will lead to a performance regression.[1] However I believe that we'll have to accept a smaller regression here, since the alternative is much worse (i.e. broken rendering).

One small benefit of these changes is that we can avoid creating lots of Stream-instances for already cached inline images.


[1] Doing some quick benchmarking in the viewer, using #pdfBug=Stats, with the PDF document from issue #2618 shows at least a 10 percent regression. Edit: My first attempt at a patch showed a 20 percent regression, which definitely seemed too much, but some additional clean-up lead to the current version that feels more acceptable.

1799927)

*Please note:* This is a tentative patch, which only fixes the "wrong letter" part of bug 1799927.

It appears that the simple `computeAdler32` function, used when caching inline images, generates hash collisions for some (very short) TypedArrays. In this case that leads to some of the "letters", which are actually inline images, being rendered incorrectly.
To avoid that we replace it with the `MurmurHash3_64` class instead, which is already used in other parts of the code-base. The one disadvantage of doing this is that it's slightly slower, which in some cases will lead to a performance regression.[1] However I believe that we'll have to accept a smaller regression here, since the alternative is much worse (i.e. broken rendering).

One small benefit of these changes is that we can avoid creating lots of `Stream`-instances for already cached inline images.

---
[1] Doing some quick benchmarking in the viewer, using `#pdfBug=Stats`, with the PDF document from issue 2618 shows at least a 10 percent regression.
This helps to *partially* offset the performance regression from the previous commit, when testing e.g. with the PDF document from issue 2618.
Given that we no longer create `Stream`-instances unconditionally, we also don't need `Dict`-instances for cached inline images (since we only access the filter).
….hexdigest`

These variables are left-over from the initial implementation, back when `String.prototype.padStart` didn't exist and we thus had to pad manually (using a loop).
@mozilla mozilla deleted a comment from pdfjsbot Nov 10, 2022
@mozilla mozilla deleted a comment from pdfjsbot Nov 10, 2022
@mozilla mozilla deleted a comment from pdfjsbot Nov 10, 2022
@mozilla mozilla deleted a comment from pdfjsbot Nov 10, 2022
@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/14ea6c70c2e23f1/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/3283059d682b28b/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/14ea6c70c2e23f1/output.txt

Total script time: 25.58 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 11

Image differences available at: http://54.241.84.105:8877/14ea6c70c2e23f1/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/3283059d682b28b/output.txt

Total script time: 34.73 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED
  different ref/snapshot: 1

Image differences available at: http://54.193.163.58:8877/3283059d682b28b/reftest-analyzer.html#web=eq.log

// ... and fetch the bytes of the *entire* dictionary.
const dictBytes = stream.getBytes(dictLength);
// ... and fetch the bytes of the dictionary *and* the inline image.
const inlineBytes = stream.getBytes(dictLength + length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

// Finally, don't forget to reset the stream position.
stream.pos = initialStreamPos;

cacheKey = computeAdler32(imageBytes) + "_" + computeAdler32(dictBytes);
const hash = new MurmurHash3_64();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When inlineBytes has a size smaller than 64 (or an other number) we could just use something like:

const MAPPING = [...Array(256).keys()].map(n =>
  String.fromCharCode(n)
);
function hash(bytes) {
  return bytes.reduce((a, x) => a + MAPPING[x], "");
}

If the size is greater than 64 we could use two caches:

  • cache: key == hash(bytes.slice(0, 64)); value: another cache: key == murmur; value: stream

This way we compute murmur only when there is a diff on the first 64 bytes.

Anyway, whatever the caching strategy is, I think we must check from stream equality at some point else we could have some other hash collisions in the future.

To avoid any memory issues we could just add an entry in the image dictionary to tell that this is a dup of an other image, and then clear the cache once the parsing is done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When inlineBytes has a size smaller than 64 (or an other number) we could just use something like:

Honestly I kind of prefer the simplicity of the current patch, with just the one code-path to worry about.
For most documents this simply doesn't matter performance wise, and then the second commit improves things for the one really bad case we know about.

Anyway, whatever the caching strategy is, I think we must check from stream equality at some point else we could have some other hash collisions in the future.

I think that's what we're finally doing in the code now, with this PR, unless I'm misunderstanding you.

To avoid any memory issues we could just add an entry in the image dictionary to tell that this is a dup of an other image, and then clear the cache once the parsing is done.

Given that the inline image cache is bound to the current Parser instance, which is limited to e.g. each getOperatorList instance, this should already be going away when parsing is done; unless I'm misunderstanding what you mean here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's what we're finally doing in the code now, with this PR, unless I'm misunderstanding you.

I don't think so. I just changed cacheKey = hash.hexdigest(); to cacheKey = "123"; to create some collisions and all the images are the same (it was ... for stream equality... (not from) in my sentence).
With a better hasher, you just decrease the probability to have a collision but you don't avoid them.
It was why I proposed this way to do with small images: we've a hash which is unique, hence we don't need to check for equality (the key equality check is done on the native code's side).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I don't really understand what you mean here.

Please keep in mind that since Streams are never cached on the XRef instance, for a number of reasons[1], running e.g. makeSubStream will always create a new Stream-instance and you thus can't directly compare them.
Hence why we need to compute some kind of hash from the data relevant to each inline image.


[1] These include: Memory usage since DecodeStreams can be very large once decoded, and concurrently accessing the same Stream-instance from e.g. multiple getOperatorList invocations wouldn't work (given that parsing is async).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "comparing stream" I meant comparing the "comparing inlineBytes", sorry about not having be more clear.
We could have two different inlineBytes with the same murmur or whatever hash and in this case it'll lead to render the same image for both, it's what it's happening when replacing hash.digest() with "123".

Copy link
Contributor

@calixteman calixteman Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you about having a very low probability to have the same hash for two different images.
But, if in 1 day, 1month, 1 year, 1 decade, ... someone a file bug because the rendering is not correct, what will we do? Find an other way to hash which fixes the bug and doesn't introduce any regressions, that's fine but I'll write exactly the same comment as this one.
We're lucky because someone noticed the bug, but I'm pretty sure that some people read some pdfs without noticing the bug because they just don't know what is the real correct rendering and hence they're maybe reading some wrong information. As said the probability is likely very low but we've ton of users/pdfs in the world so it's maybe not negligible.

So I'm fine to take this patch because it's fixing something and because it's an improvement (decreasing the probability to have a collision is an improvement) but we really must write a follow-up to fix the real issue (which is only about checking bytes equality).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would suffice to only check the bytes for "short" images, since you used 64 in your example above?
If so, the overall performance impact of better checks would probably be considerably lower in general.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should check equality for all images we cache: the hash would be just here to avoid useless comparisons.
I don't know how MAX_LENGTH_TO_CACHE has been guessed, but we could change it to 64/128, ... (we can make some stats to try figure out a good number from the pdfs in our test suite). Then generate a string from the inlineBytes (the shorter the better) and use this string as a key. This way the hashing stuff and the key comparison will be done in the js engine and we'll have to only support the toString conversion and the memory overhead.

We can even keep 1000 (I'd prefer 1024 ;)) and see how it behaves irl and just verify that we correctly get rid of the Parser object to be sure we don't keep some memory. I'd tend to think that having multiple identical 1K-image should be pretty rare.
We could have a memory inflation in case we've a lot of unique small images: to avoid that we can limit the cache size and keep some stats on how many dups an image has and then when we reach the cache limit we just remove an entry with no dups.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how MAX_LENGTH_TO_CACHE has been guessed, but we could change it to 64/128,

Unfortunately those values are way too low in practice, and could render the caching pointless.
Note that according to the PDF specification, inline images are allowed to be up to 4KB in size.

I've got another patch which simply caches using the stringified inline image data, and it seems to perform well enough in practice with the PDF documents I've tested.

We could have a memory inflation in case we've a lot of unique small images: to avoid that we can limit the cache size and keep some stats on how many dups an image has and then when we reach the cache limit we just remove an entry with no dups.

Do you consider this a blocker, or can we implement that if the need ever arises (e.g. a report about bad memory usage)?
As-is the new approach should at least prevent any rendering errors because of hash collisions.

Copy link
Contributor

@calixteman calixteman Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this isn't a blocker and yes we'll wait for either a bug report or if we notice ourselves something wrong in profiling a pdf. Or if one of us has a good idea to implement a good caching strategy ...

@Snuffleupagus Snuffleupagus deleted the bug-1799927 branch November 10, 2022 21:55
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants