-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Use MurmurHash3_64
when computing the cacheKey for inline images (bug 1799927)
#15677
Conversation
1799927) *Please note:* This is a tentative patch, which only fixes the "wrong letter" part of bug 1799927. It appears that the simple `computeAdler32` function, used when caching inline images, generates hash collisions for some (very short) TypedArrays. In this case that leads to some of the "letters", which are actually inline images, being rendered incorrectly. To avoid that we replace it with the `MurmurHash3_64` class instead, which is already used in other parts of the code-base. The one disadvantage of doing this is that it's slightly slower, which in some cases will lead to a performance regression.[1] However I believe that we'll have to accept a smaller regression here, since the alternative is much worse (i.e. broken rendering). One small benefit of these changes is that we can avoid creating lots of `Stream`-instances for already cached inline images. --- [1] Doing some quick benchmarking in the viewer, using `#pdfBug=Stats`, with the PDF document from issue 2618 shows at least a 10 percent regression.
05ea8b2
to
0eefc4a
Compare
This helps to *partially* offset the performance regression from the previous commit, when testing e.g. with the PDF document from issue 2618. Given that we no longer create `Stream`-instances unconditionally, we also don't need `Dict`-instances for cached inline images (since we only access the filter).
….hexdigest` These variables are left-over from the initial implementation, back when `String.prototype.padStart` didn't exist and we thus had to pad manually (using a loop).
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.241.84.105:8877/14ea6c70c2e23f1/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.193.163.58:8877/3283059d682b28b/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.241.84.105:8877/14ea6c70c2e23f1/output.txt Total script time: 25.58 mins
Image differences available at: http://54.241.84.105:8877/14ea6c70c2e23f1/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/3283059d682b28b/output.txt Total script time: 34.73 mins
Image differences available at: http://54.193.163.58:8877/3283059d682b28b/reftest-analyzer.html#web=eq.log |
// ... and fetch the bytes of the *entire* dictionary. | ||
const dictBytes = stream.getBytes(dictLength); | ||
// ... and fetch the bytes of the dictionary *and* the inline image. | ||
const inlineBytes = stream.getBytes(dictLength + length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
// Finally, don't forget to reset the stream position. | ||
stream.pos = initialStreamPos; | ||
|
||
cacheKey = computeAdler32(imageBytes) + "_" + computeAdler32(dictBytes); | ||
const hash = new MurmurHash3_64(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When inlineBytes
has a size smaller than 64 (or an other number) we could just use something like:
const MAPPING = [...Array(256).keys()].map(n =>
String.fromCharCode(n)
);
function hash(bytes) {
return bytes.reduce((a, x) => a + MAPPING[x], "");
}
If the size is greater than 64 we could use two caches:
- cache: key == hash(bytes.slice(0, 64)); value: another cache: key == murmur; value: stream
This way we compute murmur only when there is a diff on the first 64 bytes.
Anyway, whatever the caching strategy is, I think we must check from stream equality at some point else we could have some other hash collisions in the future.
To avoid any memory issues we could just add an entry in the image dictionary to tell that this is a dup of an other image, and then clear the cache once the parsing is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When
inlineBytes
has a size smaller than 64 (or an other number) we could just use something like:
Honestly I kind of prefer the simplicity of the current patch, with just the one code-path to worry about.
For most documents this simply doesn't matter performance wise, and then the second commit improves things for the one really bad case we know about.
Anyway, whatever the caching strategy is, I think we must check from stream equality at some point else we could have some other hash collisions in the future.
I think that's what we're finally doing in the code now, with this PR, unless I'm misunderstanding you.
To avoid any memory issues we could just add an entry in the image dictionary to tell that this is a dup of an other image, and then clear the cache once the parsing is done.
Given that the inline image cache is bound to the current Parser
instance, which is limited to e.g. each getOperatorList
instance, this should already be going away when parsing is done; unless I'm misunderstanding what you mean here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's what we're finally doing in the code now, with this PR, unless I'm misunderstanding you.
I don't think so. I just changed cacheKey = hash.hexdigest();
to cacheKey = "123";
to create some collisions and all the images are the same (it was ... for stream equality...
(not from
) in my sentence).
With a better hasher, you just decrease the probability to have a collision but you don't avoid them.
It was why I proposed this way to do with small images: we've a hash which is unique, hence we don't need to check for equality (the key equality check is done on the native code's side).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I don't really understand what you mean here.
Please keep in mind that since Stream
s are never cached on the XRef
instance, for a number of reasons[1], running e.g. makeSubStream
will always create a new Stream
-instance and you thus can't directly compare them.
Hence why we need to compute some kind of hash from the data relevant to each inline image.
[1] These include: Memory usage since DecodeStream
s can be very large once decoded, and concurrently accessing the same Stream
-instance from e.g. multiple getOperatorList
invocations wouldn't work (given that parsing is async).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "comparing stream" I meant comparing the "comparing inlineBytes
", sorry about not having be more clear.
We could have two different inlineBytes
with the same murmur or whatever hash and in this case it'll lead to render the same image for both, it's what it's happening when replacing hash.digest()
with "123"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you about having a very low probability to have the same hash for two different images.
But, if in 1 day, 1month, 1 year, 1 decade, ... someone a file bug because the rendering is not correct, what will we do? Find an other way to hash which fixes the bug and doesn't introduce any regressions, that's fine but I'll write exactly the same comment as this one.
We're lucky because someone noticed the bug, but I'm pretty sure that some people read some pdfs without noticing the bug because they just don't know what is the real correct rendering and hence they're maybe reading some wrong information. As said the probability is likely very low but we've ton of users/pdfs in the world so it's maybe not negligible.
So I'm fine to take this patch because it's fixing something and because it's an improvement (decreasing the probability to have a collision is an improvement) but we really must write a follow-up to fix the real issue (which is only about checking bytes equality).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would suffice to only check the bytes for "short" images, since you used 64 in your example above?
If so, the overall performance impact of better checks would probably be considerably lower in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should check equality for all images we cache: the hash would be just here to avoid useless comparisons.
I don't know how MAX_LENGTH_TO_CACHE
has been guessed, but we could change it to 64/128, ... (we can make some stats to try figure out a good number from the pdfs in our test suite). Then generate a string from the inlineBytes
(the shorter the better) and use this string as a key. This way the hashing stuff and the key comparison will be done in the js engine and we'll have to only support the toString conversion and the memory overhead.
We can even keep 1000 (I'd prefer 1024 ;)) and see how it behaves irl and just verify that we correctly get rid of the Parser object to be sure we don't keep some memory. I'd tend to think that having multiple identical 1K-image should be pretty rare.
We could have a memory inflation in case we've a lot of unique small images: to avoid that we can limit the cache size and keep some stats on how many dups an image has and then when we reach the cache limit we just remove an entry with no dups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how
MAX_LENGTH_TO_CACHE
has been guessed, but we could change it to 64/128,
Unfortunately those values are way too low in practice, and could render the caching pointless.
Note that according to the PDF specification, inline images are allowed to be up to 4KB in size.
I've got another patch which simply caches using the stringified inline image data, and it seems to perform well enough in practice with the PDF documents I've tested.
We could have a memory inflation in case we've a lot of unique small images: to avoid that we can limit the cache size and keep some stats on how many dups an image has and then when we reach the cache limit we just remove an entry with no dups.
Do you consider this a blocker, or can we implement that if the need ever arises (e.g. a report about bad memory usage)?
As-is the new approach should at least prevent any rendering errors because of hash collisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this isn't a blocker and yes we'll wait for either a bug report or if we notice ourselves something wrong in profiling a pdf. Or if one of us has a good idea to implement a good caching strategy ...
Please note: This is a tentative patch, which only fixes the "wrong letter" part of bug 1799927.
It appears that the simple
computeAdler32
function, used when caching inline images, generates hash collisions for some (very short) TypedArrays. In this case that leads to some of the "letters", which are actually inline images, being rendered incorrectly.To avoid that we replace it with the
MurmurHash3_64
class instead, which is already used in other parts of the code-base. The one disadvantage of doing this is that it's slightly slower, which in some cases will lead to a performance regression.[1] However I believe that we'll have to accept a smaller regression here, since the alternative is much worse (i.e. broken rendering).One small benefit of these changes is that we can avoid creating lots of
Stream
-instances for already cached inline images.[1] Doing some quick benchmarking in the viewer, using
#pdfBug=Stats
, with the PDF document from issue #2618 shows at least a 10 percent regression. Edit: My first attempt at a patch showed a 20 percent regression, which definitely seemed too much, but some additional clean-up lead to the current version that feels more acceptable.