Use a folded multiply as finalizer. #55

orlp · 2025-01-26T14:22:56Z

The current finalizer consisting of a right-rotation of 20 bits can create fairly catastrophic results if the bottom input bits are low-entropy and the hash table grows beyond 2^20 elements, see rust-lang/rust#135477 (comment).

In this PR I want to try replacing the finalizer with a folded multiply instead to evenly spread the entropy across all bits. I've changed the order of the multiply and add in add_to_hash to ensure that for single-integer inputs we still only do one (widening) multiply. The first multiplication and addition are multiplying by/adding to zero so should be optimized out.

the8472 · 2025-01-26T14:35:20Z

src/lib.rs

-
-        self.hash.rotate_left(ROTATE) as u64
+        {
+            let full = (self.hash as u128).wrapping_mul(K as u128);


with the nightly feature it could use widening_mul

I really don't see the point since it does the same thing and we still have to support stable anyway.

just for dogfooding, but sure, not a priority.

…, r=<try> [DO NOT MERGE] perf run for rustc-hash candidate (folded multiply) See rust-lang/rustc-hash#55.

the8472 · 2025-01-26T14:48:18Z

Since the high bits should already be adequately mixed, wouldn't xoring the rotate with the old value be sufficient while having 2-3 cycles lower latency?

orlp · 2025-01-26T14:58:44Z

@the8472 XOR'ing with an even number of rotations is not reversible, so already a bit suspect without strong justification. Either way it would be the same latency, this PR has a total latency of 5 cycles for the pointer hash on intel (4 for the widening multiply, 1 for the XOR), whereas your proposed scheme would also have 5 (3 for the regular multiply, 1 for the rotate, 1 for the XOR).

the8472 · 2025-01-26T15:03:46Z

whereas your proposed scheme would also have 5 (3 for the regular multiply, 1 for the rotate, 1 for the XOR).

I mean no multiply for the finalize. Just (self.hash ^ self.hash.rotate_left(ROTATE)) as u64.

Edit: Oh I missed the part about add_to_hash eliminating a mul for single integers.

…, r=<try> [DO NOT MERGE] perf run for rustc-hash candidate (folded multiply) See rust-lang/rustc-hash#55.

WaffleLapkin · 2025-01-26T15:38:16Z

If this is merged, the documentation for the FxHasher struct should be fixed (it currently mentions the bit rotate)

steffahn · 2025-01-26T16:28:57Z

README.md as well

…, r=<try> [DO NOT MERGE] perf run for rustc-hash candidate (folded multiply) See rust-lang/rustc-hash#55.

orlp · 2025-01-26T20:39:31Z

https://perf.rust-lang.org/compare.html?start=15c6f7e1a3a0e51c9b18ce5b9a391e0c324b751c&end=d6944297e3c614ac689e05c9829840bd153ad632&stat=cycles%3Au

Welp, as expected this is worse than just the rotate finalize. It is a 4 -> 5 cycle latency regression, so 25% for integer/pointer hashing (assuming the hash function does get properly inlined and optimized). Assuming no benefits from better distribution, this means our 2% overall regression would correspond to spending ~8% of our total runtime hashing integers/pointers.

the8472 · 2025-01-26T23:27:22Z

I assume some combinations of add,xor,rotate was already tried and too low quality compared to the multiply?
The only other option with better bit mixing that I can think of is the crc32 instruction, but for that we'd need SSE4.2 / x86-64-v2... which is not included in any of our baselines... except x86-64 android, hah.

orlp · 2025-01-26T23:59:13Z

@the8472 I did propose the following 4-cycle finalizer (same as we have now with multiply -> rotate) in the Rust community Discord:

hash.wrapping_mul(K) ^ std::arch::x86_64::_mm_crc32_u64(0, hash)

This takes care of both the high bits (with the multiply) and the low bits (with the CRC). But it needs SSE4.2...

Other than that I don't know anything <= 4 cycles on x86-64 which also mixes entropy into the lower bits. The fastest I know is the folded multiply at 5 cycles (this PR), or a multiply followed by a xorshift, also at 5 cycles latency:

hash = hash.wrapping_mul(K);
hash ^= hash >> 32;

Another alternative is to keep the current finalizer but tweak the rotation constant to be higher to, say, 27. This does increase the vulnerability to high-input-bit hash collisions but means we won't get the catastrophic collapse for pointers until you put 2^27 objects into one table (two orders of magnitude more than the current limit).

A final alternative is to go upstream and/or fork hashbrown to make it use the top bits of the hash so we can remove the finalizer entirely.

ChrisDenton · 2025-01-27T00:38:20Z

A final alternative is to go upstream and/or fork hashbrown to make it use the top bits of the hash so we can remove the finalizer entirely.

Whatever you end up with, this alternative might be worth at least discussing with upstream if there is a general case for it (i.e. not rustc specific).

lsunsi · 2025-02-01T15:56:54Z

@orlp Hey, is this PR stuck due to the 5 cycle issue? Sorry to ask but it wasn't clear to me.

Also, if the discussion about solutions to the massive collision problem is happening somewhere else, would you mind sharing a link to it?

I'm learning a lot from this issue and I'd love to follow the discussion to learn some more.

orlp · 2025-02-01T17:00:37Z

@lsunsi Yes, the problem is that fixing the issue for your case makes everyone else ~0.15-0.3% slower (on average, some more some less).

lsunsi · 2025-02-01T17:04:59Z

@orlp Yeah that's a bummer. What's so special to my case that it seems I'm the only one getting these collisions? Is it about size? Because we have a lot of lines of rust but I would think we're nowhere near the biggest one out there

orlp · 2025-02-01T17:13:46Z

@lsunsi Yes, you hit a rather severe thresholding effect I did not foresee by being abnormally large in one particular dimension. I don't know what that dimension is, all I know is that you have more than 2^20 elements in one hash table, somewhere.

lsunsi · 2025-02-01T17:26:04Z

@orlp So another way I could try to get around this issue would be to find out why is this happening and maybe make changes to source in order to avoid it? I'm being vague because as you can probably tell I can't fully grasp which hash table are you referring to because I don't fully grasp how rustc-hash affects my compilation time exactly, are there any resources you can point to that might help in this regard?

steffahn · 2025-02-01T17:57:44Z

@lsunsi it's about a particular hash table inside of the compiler, collecting some information about generic arguments (probably mostly type arguments are relevant). I don't know the function of the table in question in more detail, either. If I had to guess, this is probably something like an effect of code that monomorphizes to end up using a lot of distinct and/or deeply nested types.

I don't think you need to explore workarounds on your end too eagerly. Even if finding a more satisfactory fix for producing hash value with better overall resilience to issues like these, without the overhead, would turn out too challenging, there should be enough possibility for straightforward alternative solutions from the compiler's side: For instance, the approach of increasing the hardcoded "20" bits to something higher seems likely unproblematic. Maybe to 28, 30, or 32?

Maybe it's a good idea regardless, to ship such a quick-fix within the next <2 weeks, then it would at least already be on its way to 1.86 with the regular release train.

lsunsi · 2025-02-01T18:03:29Z

@steffahn Got it. Yeah I think @lochetti mentioned on the other issue that we use diesel heavily. Since it's SQL queries are type-checked they seem to generate really huge deeply nested types so you're probably on point.

Does changing the 20 to a higher number have any drawbacks? What would it take for this "hot fix" to land? Of course I'd be interested in any fix that does not leave my project stuck on 1.83 indefinitely do to triple compile times haha

steffahn · 2025-02-01T18:19:57Z

For drawbacks, @orlp maybe has a better idea if there could be any. Letting it ship through a whole beta cycle should also help with noticing potential unintended drawbacks.

It should be easy enough to decide on taking some approach of solution and/or interim measure for inclusion at the very latest until 1.87 - we definitely shouldn't be having you "stuck indefinitely".

orlp · 2025-02-01T18:30:53Z

I think we could try increasing the rotation to 26 for now, that bumps up the problematic region by a factor of 64x while still leaving the bottom 6 bits of the upper half of a 64-bit integer able to influence the hashbrown bucket bits.

Long term I'd prefer to move to a 5-cycle finisher or investigate a solution that changes hashbrown's scheme to look at the top bits rather than the bottom bits for true robustness.

lsunsi · 2025-02-01T19:37:31Z

@orlp @steffahn For what it's worth, I compiled my project testing some values for src/lib.rs:26 on rustc-hash. Results were as following.

ROTATE	time
20	10m26s
21	5m05s
22	3m49s
23	3m49s
26	3m49s

So basically you guys nailed the reason down and a move to 26 would basically yield the same benefits from this PR (in my codebase) without the drawbacks in cycles, at least from what I can gather.

lsunsi · 2025-02-02T10:18:02Z

@orlp @steffahn Would it be helpful if I opened a PR here and on rust itself for the hot-fix change to 26? It seems trivial and I'd love to help if it's all settled

lqd · 2025-02-02T10:36:31Z

What could be extremely helpful would be a reproducer, so that it could be used to check future PRs, the rot 26, moving to a 5-cycle finalizer, changing hashbrown, comparing to other alternatives, and so on.

lsunsi · 2025-02-02T10:48:31Z

@lqd Ok I'll try to figure out a reproducer again with the things we found out. The thing that makes it hard is that I don't know exactly what's the offending hashmap is used for, so if anyone has any tips it'd be apreciated.

steffahn · 2025-02-02T11:26:17Z

@lsunsi you can use existing tracing infrastructure to fly less blindly. If you execute the compiler e.g. with

RUSTC_LOG=rustc_interface::passes=info

environment variable set (such info-level tracing is generally available on normal prebuilt rustc, no need to re-build it yourself even), you get an output per rustc invocation such as

INFO rustc_interface::passes 0 parse sess buffered_lints
 INFO rustc_interface::passes Pre-codegen
 Ty interner             total           ty lt ct all
     Adt               :     20  8.5%,  0.0%   4.3%  0.0%  0.0%
     Array             :     17  7.3%,  0.0%   4.3%  0.9%  0.0%
     Slice             :      4  1.7%,  0.0%   0.4%  0.0%  0.0%
     RawPtr            :      1  0.4%,  0.0%   0.0%  0.0%  0.0%
     Ref               :     49 20.9%,  1.7%  14.1%  0.9%  0.0%
     FnDef             :     11  4.7%,  0.0%   2.6%  0.9%  0.0%
     FnPtr             :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Placeholder       :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Coroutine         :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     CoroutineWitness  :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Dynamic           :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Closure           :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     CoroutineClosure  :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Tuple             :      1  0.4%,  0.0%   0.0%  0.0%  0.0%
     Bound             :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Param             :      5  2.1%,  0.0%   0.0%  0.0%  0.0%
     Infer             :    126 53.8%, 42.7%   0.0%  0.0%  0.0%
     Alias             :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Pat               :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Foreign           :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
                   total    234        44.4%  25.6%  2.6%  0.0%
 GenericArgs interner: #75
 Region interner: #545
 Const Allocation interner: #1
 Layout interner: #1
 INFO rustc_interface::passes Post-codegen
 Ty interner             total           ty lt ct all
     Adt               :     20  8.5%,  0.0%   4.3%  0.0%  0.0%
     Array             :     17  7.3%,  0.0%   4.3%  0.9%  0.0%
     Slice             :      4  1.7%,  0.0%   0.4%  0.0%  0.0%
     RawPtr            :      1  0.4%,  0.0%   0.0%  0.0%  0.0%
     Ref               :     49 20.9%,  1.7%  14.1%  0.9%  0.0%
     FnDef             :     11  4.7%,  0.0%   2.6%  0.9%  0.0%
     FnPtr             :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Placeholder       :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Coroutine         :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     CoroutineWitness  :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Dynamic           :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Closure           :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     CoroutineClosure  :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Tuple             :      1  0.4%,  0.0%   0.0%  0.0%  0.0%
     Bound             :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Param             :      5  2.1%,  0.0%   0.0%  0.0%  0.0%
     Infer             :    126 53.8%, 42.7%   0.0%  0.0%  0.0%
     Alias             :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Pat               :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Foreign           :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
                   total    234        44.4%  25.6%  2.6%  0.0%
 GenericArgs interner: #75
 Region interner: #545
 Const Allocation interner: #1
 Layout interner: #1

The interesting number to pay attention to is the GenericArgs interner (I'd assume that in Post-codegen will be the larger number), because that's printing the size of the hash table in question. If that size is still similar to your original repro (which you can probably find out easily using the same approach), then you should be on a good way to a repro; you can probably use this information to more easily figure out what parts of the code affect the size of this table by how much.

lsunsi · 2025-02-02T11:37:48Z

 INFO rustc_interface::passes Pre-codegen
 Ty interner             total           ty lt ct all
     Adt               : 1244809 38.3%, 11.5%  15.2%  0.0%  0.0%
     Array             :  10583  0.3%,  0.0%   0.3%  0.1%  0.0%
     Slice             :   3809  0.1%,  0.0%   0.1%  0.0%  0.0%
     RawPtr            :  10799  0.3%,  0.0%   0.0%  0.0%  0.0%
     Ref               : 739030 22.7%,  5.2%  16.5%  0.1%  0.0%
     FnDef             : 260577  8.0%,  0.9%   3.0%  0.0%  0.0%
     FnPtr             :  99404  3.1%,  0.7%   1.4%  0.0%  0.0%
     Placeholder       :      2  0.0%,  0.0%   0.0%  0.0%  0.0%
     Coroutine         : 104714  3.2%,  0.3%   0.6%  0.0%  0.0%
     CoroutineWitness  :  36708  1.1%,  0.0%   0.1%  0.0%  0.0%
     Dynamic           : 141581  4.4%,  0.3%   3.4%  0.0%  0.0%
     Closure           :  97314  3.0%,  0.5%   1.4%  0.0%  0.0%
     CoroutineClosure  :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Tuple             : 134690  4.1%,  0.2%   1.4%  0.0%  0.0%
     Bound             :     35  0.0%,  0.0%   0.0%  0.0%  0.0%
     Param             :   1108  0.0%,  0.0%   0.0%  0.0%  0.0%
     Infer             :  22298  0.7%,  0.6%   0.0%  0.0%  0.0%
     Alias             : 345843 10.6%,  1.1%   4.8%  0.0%  0.0%
     Pat               :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Foreign           :      1  0.0%,  0.0%   0.0%  0.0%  0.0%
                   total 3253305        21.3%  48.2%  0.3%  0.0%
 GenericArgs interner: #3164754
 Region interner: #146132
 Const Allocation interner: #14709
 Layout interner: #13153
 INFO rustc_interface::passes Post-codegen
 Ty interner             total           ty lt ct all
     Adt               : 1929654 31.2%,  7.3%   8.7%  0.0%  0.0%
     Array             :  11443  0.2%,  0.0%   0.1%  0.0%  0.0%
     Slice             :   5004  0.1%,  0.0%   0.0%  0.0%  0.0%
     RawPtr            : 113513  1.8%,  0.0%   0.0%  0.0%  0.0%
     Ref               : 1320815 21.3%,  3.0%   9.0%  0.1%  0.0%
     FnDef             : 1225883 19.8%,  0.4%   1.6%  0.0%  0.0%
     FnPtr             : 138106  2.2%,  0.4%   0.7%  0.0%  0.0%
     Placeholder       :      2  0.0%,  0.0%   0.0%  0.0%  0.0%
     Coroutine         : 113256  1.8%,  0.1%   0.3%  0.0%  0.0%
     CoroutineWitness  :  41119  0.7%,  0.0%   0.1%  0.0%  0.0%
     Dynamic           : 144509  2.3%,  0.2%   1.8%  0.0%  0.0%
     Closure           : 159642  2.6%,  0.2%   0.8%  0.0%  0.0%
     CoroutineClosure  :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Tuple             : 230802  3.7%,  0.4%   0.8%  0.0%  0.0%
     Bound             :     35  0.0%,  0.0%   0.0%  0.0%  0.0%
     Param             :   2441  0.0%,  0.0%   0.0%  0.0%  0.0%
     Infer             :  22298  0.4%,  0.3%   0.0%  0.0%  0.0%
     Alias             : 733788 11.8%,  1.5%   2.8%  0.0%  0.0%
     Pat               :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Foreign           :      1  0.0%,  0.0%   0.0%  0.0%  0.0%
                   total 6192311        13.9%  26.6%  0.1%  0.0%
 GenericArgs interner: #5164480
 Region interner: #168783
 Const Allocation interner: #109611
 Layout interner: #34898

@steffahn Thanks a lot for the reply. Not that we needed any validation, but the number in my project seem to be totally coherent with the timing experiments I did yesterday.

steffahn · 2025-02-02T12:23:24Z

@orlp what's your recommended approach for reasoning about, or otherwise determining, these “cycle” counts?

orlp · 2025-02-02T13:36:40Z

@steffahn I don't understand the question sorry. If you're asking me to interpret the numbers generated by rustc_interface::passes=info, I have no idea. I've never looked at that before nor do I have any kind of reference frame for it.

steffahn · 2025-02-02T13:38:41Z

@orlp sorry, perhaps the detail I meant to imply/refer to in my question was too basic or just too out of context after the latest replies. But I'm not so familiar with this level of detail for performance analysis: I'm referring to determining the number "5" for a "5-cycle finalizer" 😉 I've tried to Google and found relevant tooling like llcm-mca mentioned, but I thought perhaps you have more experience and can recommend a particularly good or easy approach

orlp · 2025-02-02T17:13:14Z

@steffahn I look at the latencies listed on uops.info for the relevant instructions, and compute the latency of the longest dependency chain. The body of the finalizer is so simple that really is the only thing that matters.

Use a folded multiply as finalizer.

3a89863

the8472 reviewed Jan 26, 2025

View reviewed changes

orlp mentioned this pull request Jan 26, 2025

[DO NOT MERGE] perf run for rustc-hash candidate (folded multiply) rust-lang/rust#136095

Open

bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 26, 2025

Auto merge of rust-lang#136095 - orlp:rustc-hash-folded-multiply-perf…

bf4c342

…, r=<try> [DO NOT MERGE] perf run for rustc-hash candidate (folded multiply) See rust-lang/rustc-hash#55.

bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 26, 2025

Auto merge of rust-lang#136095 - orlp:rustc-hash-folded-multiply-perf…

3b20532

…, r=<try> [DO NOT MERGE] perf run for rustc-hash candidate (folded multiply) See rust-lang/rustc-hash#55.

Temporarily bump version to 2.2.0 for perf run.

2ccde1e

bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 26, 2025

Auto merge of rust-lang#136095 - orlp:rustc-hash-folded-multiply-perf…

d694429

…, r=<try> [DO NOT MERGE] perf run for rustc-hash candidate (folded multiply) See rust-lang/rustc-hash#55.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a folded multiply as finalizer. #55

Use a folded multiply as finalizer. #55

orlp commented Jan 26, 2025

the8472 Jan 26, 2025

orlp Jan 26, 2025

the8472 Jan 26, 2025

the8472 commented Jan 26, 2025

orlp commented Jan 26, 2025 •

edited

Loading

the8472 commented Jan 26, 2025 •

edited

Loading

WaffleLapkin commented Jan 26, 2025

steffahn commented Jan 26, 2025

orlp commented Jan 26, 2025 •

edited

Loading

the8472 commented Jan 26, 2025

orlp commented Jan 26, 2025 •

edited

Loading

ChrisDenton commented Jan 27, 2025

lsunsi commented Feb 1, 2025

orlp commented Feb 1, 2025

lsunsi commented Feb 1, 2025

orlp commented Feb 1, 2025 •

edited

Loading

lsunsi commented Feb 1, 2025

steffahn commented Feb 1, 2025 •

edited

Loading

lsunsi commented Feb 1, 2025

steffahn commented Feb 1, 2025

orlp commented Feb 1, 2025

lsunsi commented Feb 1, 2025

lsunsi commented Feb 2, 2025

lqd commented Feb 2, 2025

lsunsi commented Feb 2, 2025

steffahn commented Feb 2, 2025 •

edited

Loading

lsunsi commented Feb 2, 2025 •

edited

Loading

steffahn commented Feb 2, 2025

orlp commented Feb 2, 2025

steffahn commented Feb 2, 2025 •

edited

Loading

orlp commented Feb 2, 2025

Use a folded multiply as finalizer. #55

Are you sure you want to change the base?

Use a folded multiply as finalizer. #55

Conversation

orlp commented Jan 26, 2025

the8472 Jan 26, 2025

Choose a reason for hiding this comment

orlp Jan 26, 2025

Choose a reason for hiding this comment

the8472 Jan 26, 2025

Choose a reason for hiding this comment

the8472 commented Jan 26, 2025

orlp commented Jan 26, 2025 • edited Loading

the8472 commented Jan 26, 2025 • edited Loading

WaffleLapkin commented Jan 26, 2025

steffahn commented Jan 26, 2025

orlp commented Jan 26, 2025 • edited Loading

the8472 commented Jan 26, 2025

orlp commented Jan 26, 2025 • edited Loading

ChrisDenton commented Jan 27, 2025

lsunsi commented Feb 1, 2025

orlp commented Feb 1, 2025

lsunsi commented Feb 1, 2025

orlp commented Feb 1, 2025 • edited Loading

lsunsi commented Feb 1, 2025

steffahn commented Feb 1, 2025 • edited Loading

lsunsi commented Feb 1, 2025

steffahn commented Feb 1, 2025

orlp commented Feb 1, 2025

lsunsi commented Feb 1, 2025

lsunsi commented Feb 2, 2025

lqd commented Feb 2, 2025

lsunsi commented Feb 2, 2025

steffahn commented Feb 2, 2025 • edited Loading

lsunsi commented Feb 2, 2025 • edited Loading

steffahn commented Feb 2, 2025

orlp commented Feb 2, 2025

steffahn commented Feb 2, 2025 • edited Loading

orlp commented Feb 2, 2025

orlp commented Jan 26, 2025 •

edited

Loading

the8472 commented Jan 26, 2025 •

edited

Loading

orlp commented Jan 26, 2025 •

edited

Loading

orlp commented Jan 26, 2025 •

edited

Loading

orlp commented Feb 1, 2025 •

edited

Loading

steffahn commented Feb 1, 2025 •

edited

Loading

steffahn commented Feb 2, 2025 •

edited

Loading

lsunsi commented Feb 2, 2025 •

edited

Loading

steffahn commented Feb 2, 2025 •

edited

Loading