-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Use a folded multiply as finalizer. #55
base: master
Are you sure you want to change the base?
Conversation
|
||
self.hash.rotate_left(ROTATE) as u64 | ||
{ | ||
let full = (self.hash as u128).wrapping_mul(K as u128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the nightly feature it could use widening_mul
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't see the point since it does the same thing and we still have to support stable anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for dogfooding, but sure, not a priority.
…, r=<try> [DO NOT MERGE] perf run for rustc-hash candidate (folded multiply) See rust-lang/rustc-hash#55.
Since the high bits should already be adequately mixed, wouldn't xoring the rotate with the old value be sufficient while having 2-3 cycles lower latency? |
@the8472 XOR'ing with an even number of rotations is not reversible, so already a bit suspect without strong justification. Either way it would be the same latency, this PR has a total latency of 5 cycles for the pointer hash on intel (4 for the widening multiply, 1 for the XOR), whereas your proposed scheme would also have 5 (3 for the regular multiply, 1 for the rotate, 1 for the XOR). |
I mean no multiply for the finalize. Just Edit: Oh I missed the part about |
…, r=<try> [DO NOT MERGE] perf run for rustc-hash candidate (folded multiply) See rust-lang/rustc-hash#55.
If this is merged, the documentation for the |
|
…, r=<try> [DO NOT MERGE] perf run for rustc-hash candidate (folded multiply) See rust-lang/rustc-hash#55.
Welp, as expected this is worse than just the rotate finalize. It is a 4 -> 5 cycle latency regression, so 25% for integer/pointer hashing (assuming the hash function does get properly inlined and optimized). Assuming no benefits from better distribution, this means our 2% overall regression would correspond to spending ~8% of our total runtime hashing integers/pointers. |
I assume some combinations of add,xor,rotate was already tried and too low quality compared to the multiply? |
@the8472 I did propose the following 4-cycle finalizer (same as we have now with multiply -> rotate) in the Rust community Discord: hash.wrapping_mul(K) ^ std::arch::x86_64::_mm_crc32_u64(0, hash) This takes care of both the high bits (with the multiply) and the low bits (with the CRC). But it needs SSE4.2... Other than that I don't know anything <= 4 cycles on x86-64 which also mixes entropy into the lower bits. The fastest I know is the folded multiply at 5 cycles (this PR), or a multiply followed by a xorshift, also at 5 cycles latency: hash = hash.wrapping_mul(K);
hash ^= hash >> 32; Another alternative is to keep the current finalizer but tweak the rotation constant to be higher to, say, 27. This does increase the vulnerability to high-input-bit hash collisions but means we won't get the catastrophic collapse for pointers until you put 2^27 objects into one table (two orders of magnitude more than the current limit). A final alternative is to go upstream and/or fork |
Whatever you end up with, this alternative might be worth at least discussing with upstream if there is a general case for it (i.e. not rustc specific). |
@orlp Hey, is this PR stuck due to the 5 cycle issue? Sorry to ask but it wasn't clear to me. Also, if the discussion about solutions to the massive collision problem is happening somewhere else, would you mind sharing a link to it? I'm learning a lot from this issue and I'd love to follow the discussion to learn some more. |
@lsunsi Yes, the problem is that fixing the issue for your case makes everyone else ~0.15-0.3% slower (on average, some more some less). |
@orlp Yeah that's a bummer. What's so special to my case that it seems I'm the only one getting these collisions? Is it about size? Because we have a lot of lines of rust but I would think we're nowhere near the biggest one out there |
@lsunsi Yes, you hit a rather severe thresholding effect I did not foresee by being abnormally large in one particular dimension. I don't know what that dimension is, all I know is that you have more than 2^20 elements in one hash table, somewhere. |
@orlp So another way I could try to get around this issue would be to find out why is this happening and maybe make changes to source in order to avoid it? I'm being vague because as you can probably tell I can't fully grasp which hash table are you referring to because I don't fully grasp how rustc-hash affects my compilation time exactly, are there any resources you can point to that might help in this regard? |
@lsunsi it's about a particular hash table inside of the compiler, collecting some information about generic arguments (probably mostly type arguments are relevant). I don't know the function of the table in question in more detail, either. If I had to guess, this is probably something like an effect of code that monomorphizes to end up using a lot of distinct and/or deeply nested types. I don't think you need to explore workarounds on your end too eagerly. Even if finding a more satisfactory fix for producing hash value with better overall resilience to issues like these, without the overhead, would turn out too challenging, there should be enough possibility for straightforward alternative solutions from the compiler's side: For instance, the approach of increasing the hardcoded "20" bits to something higher seems likely unproblematic. Maybe to 28, 30, or 32? Maybe it's a good idea regardless, to ship such a quick-fix within the next <2 weeks, then it would at least already be on its way to 1.86 with the regular release train. |
@steffahn Got it. Yeah I think @lochetti mentioned on the other issue that we use Does changing the 20 to a higher number have any drawbacks? What would it take for this "hot fix" to land? Of course I'd be interested in any fix that does not leave my project stuck on 1.83 indefinitely do to triple compile times haha |
For drawbacks, @orlp maybe has a better idea if there could be any. Letting it ship through a whole beta cycle should also help with noticing potential unintended drawbacks. It should be easy enough to decide on taking some approach of solution and/or interim measure for inclusion at the very latest until 1.87 - we definitely shouldn't be having you "stuck indefinitely". |
I think we could try increasing the rotation to 26 for now, that bumps up the problematic region by a factor of 64x while still leaving the bottom 6 bits of the upper half of a 64-bit integer able to influence the hashbrown bucket bits. Long term I'd prefer to move to a 5-cycle finisher or investigate a solution that changes hashbrown's scheme to look at the top bits rather than the bottom bits for true robustness. |
@orlp @steffahn For what it's worth, I compiled my project testing some values for src/lib.rs:26 on rustc-hash. Results were as following.
So basically you guys nailed the reason down and a move to 26 would basically yield the same benefits from this PR (in my codebase) without the drawbacks in cycles, at least from what I can gather. |
What could be extremely helpful would be a reproducer, so that it could be used to check future PRs, the rot 26, moving to a 5-cycle finalizer, changing hashbrown, comparing to other alternatives, and so on. |
@lqd Ok I'll try to figure out a reproducer again with the things we found out. The thing that makes it hard is that I don't know exactly what's the offending hashmap is used for, so if anyone has any tips it'd be apreciated. |
@lsunsi you can use existing tracing infrastructure to fly less blindly. If you execute the compiler e.g. with
environment variable set (such
The interesting number to pay attention to is the |
@steffahn Thanks a lot for the reply. Not that we needed any validation, but the number in my project seem to be totally coherent with the timing experiments I did yesterday. |
@orlp what's your recommended approach for reasoning about, or otherwise determining, these “cycle” counts? |
@steffahn I don't understand the question sorry. If you're asking me to interpret the numbers generated by |
@orlp sorry, perhaps the detail I meant to imply/refer to in my question was too basic or just too out of context after the latest replies. But I'm not so familiar with this level of detail for performance analysis: I'm referring to determining the number "5" for a "5-cycle finalizer" 😉 I've tried to Google and found relevant tooling like |
The current finalizer consisting of a right-rotation of 20 bits can create fairly catastrophic results if the bottom input bits are low-entropy and the hash table grows beyond 2^20 elements, see rust-lang/rust#135477 (comment).
In this PR I want to try replacing the finalizer with a folded multiply instead to evenly spread the entropy across all bits. I've changed the order of the multiply and add in
add_to_hash
to ensure that for single-integer inputs we still only do one (widening) multiply. The first multiplication and addition are multiplying by/adding to zero so should be optimized out.