Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Use a folded multiply as finalizer. #55

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

orlp
Copy link
Contributor

@orlp orlp commented Jan 26, 2025

The current finalizer consisting of a right-rotation of 20 bits can create fairly catastrophic results if the bottom input bits are low-entropy and the hash table grows beyond 2^20 elements, see rust-lang/rust#135477 (comment).

In this PR I want to try replacing the finalizer with a folded multiply instead to evenly spread the entropy across all bits. I've changed the order of the multiply and add in add_to_hash to ensure that for single-integer inputs we still only do one (widening) multiply. The first multiplication and addition are multiplying by/adding to zero so should be optimized out.


self.hash.rotate_left(ROTATE) as u64
{
let full = (self.hash as u128).wrapping_mul(K as u128);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the nightly feature it could use widening_mul

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't see the point since it does the same thing and we still have to support stable anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for dogfooding, but sure, not a priority.

bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 26, 2025
…, r=<try>

[DO NOT MERGE] perf run for rustc-hash candidate (folded multiply)

See rust-lang/rustc-hash#55.
@the8472
Copy link
Member

the8472 commented Jan 26, 2025

Since the high bits should already be adequately mixed, wouldn't xoring the rotate with the old value be sufficient while having 2-3 cycles lower latency?

@orlp
Copy link
Contributor Author

orlp commented Jan 26, 2025

@the8472 XOR'ing with an even number of rotations is not reversible, so already a bit suspect without strong justification. Either way it would be the same latency, this PR has a total latency of 5 cycles for the pointer hash on intel (4 for the widening multiply, 1 for the XOR), whereas your proposed scheme would also have 5 (3 for the regular multiply, 1 for the rotate, 1 for the XOR).

@the8472
Copy link
Member

the8472 commented Jan 26, 2025

whereas your proposed scheme would also have 5 (3 for the regular multiply, 1 for the rotate, 1 for the XOR).

I mean no multiply for the finalize. Just (self.hash ^ self.hash.rotate_left(ROTATE)) as u64.

Edit: Oh I missed the part about add_to_hash eliminating a mul for single integers.

bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 26, 2025
…, r=<try>

[DO NOT MERGE] perf run for rustc-hash candidate (folded multiply)

See rust-lang/rustc-hash#55.
@WaffleLapkin
Copy link
Member

If this is merged, the documentation for the FxHasher struct should be fixed (it currently mentions the bit rotate)

@steffahn
Copy link
Member

README.md as well

bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 26, 2025
…, r=<try>

[DO NOT MERGE] perf run for rustc-hash candidate (folded multiply)

See rust-lang/rustc-hash#55.
@orlp
Copy link
Contributor Author

orlp commented Jan 26, 2025

https://perf.rust-lang.org/compare.html?start=15c6f7e1a3a0e51c9b18ce5b9a391e0c324b751c&end=d6944297e3c614ac689e05c9829840bd153ad632&stat=cycles%3Au

Welp, as expected this is worse than just the rotate finalize. It is a 4 -> 5 cycle latency regression, so 25% for integer/pointer hashing (assuming the hash function does get properly inlined and optimized). Assuming no benefits from better distribution, this means our 2% overall regression would correspond to spending ~8% of our total runtime hashing integers/pointers.

@the8472
Copy link
Member

the8472 commented Jan 26, 2025

I assume some combinations of add,xor,rotate was already tried and too low quality compared to the multiply?
The only other option with better bit mixing that I can think of is the crc32 instruction, but for that we'd need SSE4.2 / x86-64-v2... which is not included in any of our baselines... except x86-64 android, hah.

@orlp
Copy link
Contributor Author

orlp commented Jan 26, 2025

@the8472 I did propose the following 4-cycle finalizer (same as we have now with multiply -> rotate) in the Rust community Discord:

hash.wrapping_mul(K) ^ std::arch::x86_64::_mm_crc32_u64(0, hash)

This takes care of both the high bits (with the multiply) and the low bits (with the CRC). But it needs SSE4.2...

Other than that I don't know anything <= 4 cycles on x86-64 which also mixes entropy into the lower bits. The fastest I know is the folded multiply at 5 cycles (this PR), or a multiply followed by a xorshift, also at 5 cycles latency:

hash = hash.wrapping_mul(K);
hash ^= hash >> 32;

Another alternative is to keep the current finalizer but tweak the rotation constant to be higher to, say, 27. This does increase the vulnerability to high-input-bit hash collisions but means we won't get the catastrophic collapse for pointers until you put 2^27 objects into one table (two orders of magnitude more than the current limit).


A final alternative is to go upstream and/or fork hashbrown to make it use the top bits of the hash so we can remove the finalizer entirely.

@ChrisDenton
Copy link
Member

A final alternative is to go upstream and/or fork hashbrown to make it use the top bits of the hash so we can remove the finalizer entirely.

Whatever you end up with, this alternative might be worth at least discussing with upstream if there is a general case for it (i.e. not rustc specific).

@lsunsi
Copy link

lsunsi commented Feb 1, 2025

@orlp Hey, is this PR stuck due to the 5 cycle issue? Sorry to ask but it wasn't clear to me.

Also, if the discussion about solutions to the massive collision problem is happening somewhere else, would you mind sharing a link to it?

I'm learning a lot from this issue and I'd love to follow the discussion to learn some more.

@orlp
Copy link
Contributor Author

orlp commented Feb 1, 2025

@lsunsi Yes, the problem is that fixing the issue for your case makes everyone else ~0.15-0.3% slower (on average, some more some less).

@lsunsi
Copy link

lsunsi commented Feb 1, 2025

@orlp Yeah that's a bummer. What's so special to my case that it seems I'm the only one getting these collisions? Is it about size? Because we have a lot of lines of rust but I would think we're nowhere near the biggest one out there

@orlp
Copy link
Contributor Author

orlp commented Feb 1, 2025

@lsunsi Yes, you hit a rather severe thresholding effect I did not foresee by being abnormally large in one particular dimension. I don't know what that dimension is, all I know is that you have more than 2^20 elements in one hash table, somewhere.

@lsunsi
Copy link

lsunsi commented Feb 1, 2025

@orlp So another way I could try to get around this issue would be to find out why is this happening and maybe make changes to source in order to avoid it? I'm being vague because as you can probably tell I can't fully grasp which hash table are you referring to because I don't fully grasp how rustc-hash affects my compilation time exactly, are there any resources you can point to that might help in this regard?

@steffahn
Copy link
Member

steffahn commented Feb 1, 2025

@lsunsi it's about a particular hash table inside of the compiler, collecting some information about generic arguments (probably mostly type arguments are relevant). I don't know the function of the table in question in more detail, either. If I had to guess, this is probably something like an effect of code that monomorphizes to end up using a lot of distinct and/or deeply nested types.

I don't think you need to explore workarounds on your end too eagerly. Even if finding a more satisfactory fix for producing hash value with better overall resilience to issues like these, without the overhead, would turn out too challenging, there should be enough possibility for straightforward alternative solutions from the compiler's side: For instance, the approach of increasing the hardcoded "20" bits to something higher seems likely unproblematic. Maybe to 28, 30, or 32?

Maybe it's a good idea regardless, to ship such a quick-fix within the next <2 weeks, then it would at least already be on its way to 1.86 with the regular release train.

@lsunsi
Copy link

lsunsi commented Feb 1, 2025

@steffahn Got it. Yeah I think @lochetti mentioned on the other issue that we use diesel heavily. Since it's SQL queries are type-checked they seem to generate really huge deeply nested types so you're probably on point.

Does changing the 20 to a higher number have any drawbacks? What would it take for this "hot fix" to land? Of course I'd be interested in any fix that does not leave my project stuck on 1.83 indefinitely do to triple compile times haha

@steffahn
Copy link
Member

steffahn commented Feb 1, 2025

For drawbacks, @orlp maybe has a better idea if there could be any. Letting it ship through a whole beta cycle should also help with noticing potential unintended drawbacks.

It should be easy enough to decide on taking some approach of solution and/or interim measure for inclusion at the very latest until 1.87 - we definitely shouldn't be having you "stuck indefinitely".

@orlp
Copy link
Contributor Author

orlp commented Feb 1, 2025

I think we could try increasing the rotation to 26 for now, that bumps up the problematic region by a factor of 64x while still leaving the bottom 6 bits of the upper half of a 64-bit integer able to influence the hashbrown bucket bits.

Long term I'd prefer to move to a 5-cycle finisher or investigate a solution that changes hashbrown's scheme to look at the top bits rather than the bottom bits for true robustness.

@lsunsi
Copy link

lsunsi commented Feb 1, 2025

@orlp @steffahn For what it's worth, I compiled my project testing some values for src/lib.rs:26 on rustc-hash. Results were as following.

ROTATE time
20 10m26s
21 5m05s
22 3m49s
23 3m49s
26 3m49s

So basically you guys nailed the reason down and a move to 26 would basically yield the same benefits from this PR (in my codebase) without the drawbacks in cycles, at least from what I can gather.

@lsunsi
Copy link

lsunsi commented Feb 2, 2025

@orlp @steffahn Would it be helpful if I opened a PR here and on rust itself for the hot-fix change to 26? It seems trivial and I'd love to help if it's all settled

@lqd
Copy link
Member

lqd commented Feb 2, 2025

What could be extremely helpful would be a reproducer, so that it could be used to check future PRs, the rot 26, moving to a 5-cycle finalizer, changing hashbrown, comparing to other alternatives, and so on.

@lsunsi
Copy link

lsunsi commented Feb 2, 2025

@lqd Ok I'll try to figure out a reproducer again with the things we found out. The thing that makes it hard is that I don't know exactly what's the offending hashmap is used for, so if anyone has any tips it'd be apreciated.

@steffahn
Copy link
Member

steffahn commented Feb 2, 2025

@lsunsi you can use existing tracing infrastructure to fly less blindly. If you execute the compiler e.g. with

RUSTC_LOG=rustc_interface::passes=info

environment variable set (such info-level tracing is generally available on normal prebuilt rustc, no need to re-build it yourself even), you get an output per rustc invocation such as

INFO rustc_interface::passes 0 parse sess buffered_lints
 INFO rustc_interface::passes Pre-codegen
 Ty interner             total           ty lt ct all
     Adt               :     20  8.5%,  0.0%   4.3%  0.0%  0.0%
     Array             :     17  7.3%,  0.0%   4.3%  0.9%  0.0%
     Slice             :      4  1.7%,  0.0%   0.4%  0.0%  0.0%
     RawPtr            :      1  0.4%,  0.0%   0.0%  0.0%  0.0%
     Ref               :     49 20.9%,  1.7%  14.1%  0.9%  0.0%
     FnDef             :     11  4.7%,  0.0%   2.6%  0.9%  0.0%
     FnPtr             :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Placeholder       :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Coroutine         :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     CoroutineWitness  :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Dynamic           :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Closure           :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     CoroutineClosure  :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Tuple             :      1  0.4%,  0.0%   0.0%  0.0%  0.0%
     Bound             :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Param             :      5  2.1%,  0.0%   0.0%  0.0%  0.0%
     Infer             :    126 53.8%, 42.7%   0.0%  0.0%  0.0%
     Alias             :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Pat               :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Foreign           :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
                   total    234        44.4%  25.6%  2.6%  0.0%
 GenericArgs interner: #75
 Region interner: #545
 Const Allocation interner: #1
 Layout interner: #1
 INFO rustc_interface::passes Post-codegen
 Ty interner             total           ty lt ct all
     Adt               :     20  8.5%,  0.0%   4.3%  0.0%  0.0%
     Array             :     17  7.3%,  0.0%   4.3%  0.9%  0.0%
     Slice             :      4  1.7%,  0.0%   0.4%  0.0%  0.0%
     RawPtr            :      1  0.4%,  0.0%   0.0%  0.0%  0.0%
     Ref               :     49 20.9%,  1.7%  14.1%  0.9%  0.0%
     FnDef             :     11  4.7%,  0.0%   2.6%  0.9%  0.0%
     FnPtr             :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Placeholder       :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Coroutine         :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     CoroutineWitness  :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Dynamic           :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Closure           :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     CoroutineClosure  :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Tuple             :      1  0.4%,  0.0%   0.0%  0.0%  0.0%
     Bound             :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Param             :      5  2.1%,  0.0%   0.0%  0.0%  0.0%
     Infer             :    126 53.8%, 42.7%   0.0%  0.0%  0.0%
     Alias             :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Pat               :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Foreign           :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
                   total    234        44.4%  25.6%  2.6%  0.0%
 GenericArgs interner: #75
 Region interner: #545
 Const Allocation interner: #1
 Layout interner: #1

The interesting number to pay attention to is the GenericArgs interner (I'd assume that in Post-codegen will be the larger number), because that's printing the size of the hash table in question. If that size is still similar to your original repro (which you can probably find out easily using the same approach), then you should be on a good way to a repro; you can probably use this information to more easily figure out what parts of the code affect the size of this table by how much.

@lsunsi
Copy link

lsunsi commented Feb 2, 2025

 INFO rustc_interface::passes Pre-codegen
 Ty interner             total           ty lt ct all
     Adt               : 1244809 38.3%, 11.5%  15.2%  0.0%  0.0%
     Array             :  10583  0.3%,  0.0%   0.3%  0.1%  0.0%
     Slice             :   3809  0.1%,  0.0%   0.1%  0.0%  0.0%
     RawPtr            :  10799  0.3%,  0.0%   0.0%  0.0%  0.0%
     Ref               : 739030 22.7%,  5.2%  16.5%  0.1%  0.0%
     FnDef             : 260577  8.0%,  0.9%   3.0%  0.0%  0.0%
     FnPtr             :  99404  3.1%,  0.7%   1.4%  0.0%  0.0%
     Placeholder       :      2  0.0%,  0.0%   0.0%  0.0%  0.0%
     Coroutine         : 104714  3.2%,  0.3%   0.6%  0.0%  0.0%
     CoroutineWitness  :  36708  1.1%,  0.0%   0.1%  0.0%  0.0%
     Dynamic           : 141581  4.4%,  0.3%   3.4%  0.0%  0.0%
     Closure           :  97314  3.0%,  0.5%   1.4%  0.0%  0.0%
     CoroutineClosure  :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Tuple             : 134690  4.1%,  0.2%   1.4%  0.0%  0.0%
     Bound             :     35  0.0%,  0.0%   0.0%  0.0%  0.0%
     Param             :   1108  0.0%,  0.0%   0.0%  0.0%  0.0%
     Infer             :  22298  0.7%,  0.6%   0.0%  0.0%  0.0%
     Alias             : 345843 10.6%,  1.1%   4.8%  0.0%  0.0%
     Pat               :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Foreign           :      1  0.0%,  0.0%   0.0%  0.0%  0.0%
                   total 3253305        21.3%  48.2%  0.3%  0.0%
 GenericArgs interner: #3164754
 Region interner: #146132
 Const Allocation interner: #14709
 Layout interner: #13153
 INFO rustc_interface::passes Post-codegen
 Ty interner             total           ty lt ct all
     Adt               : 1929654 31.2%,  7.3%   8.7%  0.0%  0.0%
     Array             :  11443  0.2%,  0.0%   0.1%  0.0%  0.0%
     Slice             :   5004  0.1%,  0.0%   0.0%  0.0%  0.0%
     RawPtr            : 113513  1.8%,  0.0%   0.0%  0.0%  0.0%
     Ref               : 1320815 21.3%,  3.0%   9.0%  0.1%  0.0%
     FnDef             : 1225883 19.8%,  0.4%   1.6%  0.0%  0.0%
     FnPtr             : 138106  2.2%,  0.4%   0.7%  0.0%  0.0%
     Placeholder       :      2  0.0%,  0.0%   0.0%  0.0%  0.0%
     Coroutine         : 113256  1.8%,  0.1%   0.3%  0.0%  0.0%
     CoroutineWitness  :  41119  0.7%,  0.0%   0.1%  0.0%  0.0%
     Dynamic           : 144509  2.3%,  0.2%   1.8%  0.0%  0.0%
     Closure           : 159642  2.6%,  0.2%   0.8%  0.0%  0.0%
     CoroutineClosure  :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Tuple             : 230802  3.7%,  0.4%   0.8%  0.0%  0.0%
     Bound             :     35  0.0%,  0.0%   0.0%  0.0%  0.0%
     Param             :   2441  0.0%,  0.0%   0.0%  0.0%  0.0%
     Infer             :  22298  0.4%,  0.3%   0.0%  0.0%  0.0%
     Alias             : 733788 11.8%,  1.5%   2.8%  0.0%  0.0%
     Pat               :      0  0.0%,  0.0%   0.0%  0.0%  0.0%
     Foreign           :      1  0.0%,  0.0%   0.0%  0.0%  0.0%
                   total 6192311        13.9%  26.6%  0.1%  0.0%
 GenericArgs interner: #5164480
 Region interner: #168783
 Const Allocation interner: #109611
 Layout interner: #34898

@steffahn Thanks a lot for the reply. Not that we needed any validation, but the number in my project seem to be totally coherent with the timing experiments I did yesterday.

@steffahn
Copy link
Member

steffahn commented Feb 2, 2025

@orlp what's your recommended approach for reasoning about, or otherwise determining, these “cycle” counts?

@orlp
Copy link
Contributor Author

orlp commented Feb 2, 2025

@steffahn I don't understand the question sorry. If you're asking me to interpret the numbers generated by rustc_interface::passes=info, I have no idea. I've never looked at that before nor do I have any kind of reference frame for it.

@steffahn
Copy link
Member

steffahn commented Feb 2, 2025

@orlp sorry, perhaps the detail I meant to imply/refer to in my question was too basic or just too out of context after the latest replies. But I'm not so familiar with this level of detail for performance analysis: I'm referring to determining the number "5" for a "5-cycle finalizer" 😉 I've tried to Google and found relevant tooling like llcm-mca mentioned, but I thought perhaps you have more experience and can recommend a particularly good or easy approach

@orlp
Copy link
Contributor Author

orlp commented Feb 2, 2025

@steffahn I look at the latencies listed on uops.info for the relevant instructions, and compute the latency of the longest dependency chain. The body of the finalizer is so simple that really is the only thing that matters.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants