Skip to content

derived PartialOrd::le is non-optimal for 2-field struct of primitive integers #106107

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
scottmcm opened this issue Dec 24, 2022 · 1 comment · Fixed by #137197
Closed

derived PartialOrd::le is non-optimal for 2-field struct of primitive integers #106107

scottmcm opened this issue Dec 24, 2022 · 1 comment · Fixed by #137197
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-heavy Issue: Problems and improvements with respect to binary size of generated code. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@scottmcm
Copy link
Member

scottmcm commented Dec 24, 2022

This codegen test should pass: comparison-operators-twofields.patch https://play.rust-lang.org/?version=nightly&mode=release&edition=2021&gist=4c8846efffd0574ec44e60c3cffbe842

And it turns out that the derived PartialOrd::lt does pass that test

define noundef zeroext i1 @check_lt(i16 %0, i16 %1, i16 %2, i16 %3) unnamed_addr #0 {
start:
  %_6.i.i = icmp slt i16 %0, %2
  %_9.i.i.not = icmp eq i16 %0, %2
  %_6.i3.i = icmp ult i16 %1, %3
  %4 = select i1 %_9.i.i.not, i1 %_6.i3.i, i1 %_6.i.i
  ret i1 %4
}

but the derived le, gt, and ge do not pass it.

define noundef zeroext i1 @check_gt(i16 %0, i16 %1, i16 %2, i16 %3) unnamed_addr #0 {
start:
  %_6.i.i = icmp slt i16 %0, %2
  %_9.i.i = icmp ne i16 %0, %2
  %..i.i = zext i1 %_9.i.i to i8
  %_3.0.i.i = select i1 %_6.i.i, i8 -1, i8 %..i.i
  %4 = icmp eq i8 %_3.0.i.i, 0
  %_6.i3.i = icmp ult i16 %1, %3
  %_9.i4.i = icmp ne i16 %1, %3
  %..i5.i = zext i1 %_9.i4.i to i8
  %_3.0.i6.i = select i1 %_6.i3.i, i8 -1, i8 %..i5.i
  %.0.i = select i1 %4, i8 %_3.0.i6.i, i8 %_3.0.i.i
  %5 = icmp eq i8 %.0.i, 1
  ret i1 %5
}

instead leaving in a whole bunch of unnecessary stuff.

Given the direction in #98655, we probably don't want to fix this by changing the derive.

So potential fixes could be changing our Ord::cmp implementation for integers (cc #105840), changing the default PartialOrd::le implementations (cc #106065), improved optimizations in LLVM (cc llvm/llvm-project#59666), or more.

@scottmcm scottmcm added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. I-heavy Issue: Problems and improvements with respect to binary size of generated code. labels Dec 24, 2022
@scottmcm
Copy link
Member Author

scottmcm commented Feb 17, 2023

I tried switching the cmp implementation for integers to use the same order as clang (see llvm/llvm-project#60012), which managed to fix le, but didn't fix gt or ge, so it's not obvious that it's worth doing as it's more x86 asm for a single cmp.

If you want to play with this yourself, there's a codegen test in https://github.com/rust-lang/rust/pull/108156/files#diff-9ee4982898ebe1fa1f721acc54dc662f4a51cb2e017c71f9e24286493358e020

But we might just need to wait for LLVM 17 and llvm/llvm-project#59666 (comment).

@Noratrieb Noratrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023
github-merge-queue bot pushed a commit to bevyengine/bevy that referenced this issue Nov 18, 2023
# Objective

- Follow up on #10519, diving
deeper into optimising `Entity` due to the `derive`d `PartialOrd`
`partial_cmp` not being optimal with codegen:
rust-lang/rust#106107
- Fixes #2346.

## Solution

Given the previous PR's solution and the other existing LLVM codegen
bug, there seemed to be a potential further optimisation possible with
`Entity`. In exploring providing manual `PartialOrd` impl, it turned out
initially that the resulting codegen was not immediately better than the
derived version. However, once `Entity` was given `#[repr(align(8)]`,
the codegen improved remarkably, even more once the fields in `Entity`
were rearranged to correspond to a `u64` layout (Rust doesn't
automatically reorder fields correctly it seems). The field order and
`align(8)` additions also improved `to_bits` codegen to be a single
`mov` op. In turn, this led me to replace the previous
"non-shortcircuiting" impl of `PartialEq::eq` to use direct `to_bits`
comparison.

The result was remarkably better codegen across the board, even for
hastable lookups.

The current baseline codegen is as follows:
https://godbolt.org/z/zTW1h8PnY

Assuming the following example struct that mirrors with the existing
`Entity` definition:

```rust
#[derive(Clone, Copy, Eq, PartialEq, PartialOrd, Ord)]
pub struct FakeU64 {
    high: u32,
    low: u32,
}
```

the output for `to_bits` is as follows:

```
example::FakeU64::to_bits:
        shl     rdi, 32
        mov     eax, esi
        or      rax, rdi
        ret
```

Changing the struct to:
```rust
#[derive(Clone, Copy, Eq)]
#[repr(align(8))]
pub struct FakeU64 {
    low: u32,
    high: u32,
}
```
and providing manual implementations for `PartialEq`/`PartialOrd`/`Ord`,
`to_bits` now optimises to:
```
example::FakeU64::to_bits:
        mov     rax, rdi
        ret
```
The full codegen example for this PR is here for reference:
https://godbolt.org/z/n4Mjx165a

To highlight, `gt` comparison goes from
```
example::greater_than:
        cmp     edi, edx
        jae     .LBB3_2
        xor     eax, eax
        ret
.LBB3_2:
        setne   dl
        cmp     esi, ecx
        seta    al
        or      al, dl
        ret
```
to
```
example::greater_than:
        cmp     rdi, rsi
        seta    al
        ret
```

As explained on Discord by @scottmcm :

>The root issue here, as far as I understand it, is that LLVM's
middle-end is inexplicably unwilling to merge loads if that would make
them under-aligned. It leaves that entirely up to its target-specific
back-end, and thus a bunch of the things that you'd expect it to do that
would fix this just don't happen.

## Benchmarks

Before discussing benchmarks, everything was tested on the following
specs:

AMD Ryzen 7950X 16C/32T CPU
64GB 5200 RAM
AMD RX7900XT 20GB Gfx card
Manjaro KDE on Wayland

I made use of the new entity hashing benchmarks to see how this PR would
improve things there. With the changes in place, I first did an
implementation keeping the existing "non shortcircuit" `PartialEq`
implementation in place, but with the alignment and field ordering
changes, which in the benchmark is the `ord_shortcircuit` column. The
`to_bits` `PartialEq` implementation is the `ord_to_bits` column. The
main_ord column is the current existing baseline from `main` branch.


![Screenshot_20231114_132908](https://github.com/bevyengine/bevy/assets/3116268/cb9090c9-ff74-4cc5-abae-8e4561332261)

My machine is not super set-up for benchmarking, so some results are
within noise, but there's not just a clear improvement between the
non-shortcircuiting implementation, but even further optimisation taking
place with the `to_bits` implementation.

On my machine, a fair number of the stress tests were not showing any
difference (indicating other bottlenecks), but I was able to get a clear
difference with `many_foxes` with a fox count of 10,000:

Test with `cargo run --example many_foxes --features
bevy/trace_tracy,wayland --release -- --count 10000`:


![Screenshot_20231114_144217](https://github.com/bevyengine/bevy/assets/3116268/89bdc21c-7209-43c8-85ae-efbf908bfed3)

On avg, a framerate of about 28-29FPS was improved to 30-32FPS. "This
trace" represents the current PR's perf, while "External trace"
represents the `main` branch baseline.

## Changelog

Changed: micro-optimized Entity align and field ordering as well as
providing manual `PartialOrd`/`Ord` impls to help LLVM optimise further.

## Migration Guide

Any `unsafe` code relying on field ordering of `Entity` or sufficiently
cursed shenanigans should change to reflect the different internal
representation and alignment requirements of `Entity`.

Co-authored-by: james7132 <contact@jamessliu.com>
Co-authored-by: NathanW <nathansward@comcast.net>
rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this issue Jan 9, 2024
…gine#10558)

# Objective

- Follow up on bevyengine#10519, diving
deeper into optimising `Entity` due to the `derive`d `PartialOrd`
`partial_cmp` not being optimal with codegen:
rust-lang/rust#106107
- Fixes bevyengine#2346.

## Solution

Given the previous PR's solution and the other existing LLVM codegen
bug, there seemed to be a potential further optimisation possible with
`Entity`. In exploring providing manual `PartialOrd` impl, it turned out
initially that the resulting codegen was not immediately better than the
derived version. However, once `Entity` was given `#[repr(align(8)]`,
the codegen improved remarkably, even more once the fields in `Entity`
were rearranged to correspond to a `u64` layout (Rust doesn't
automatically reorder fields correctly it seems). The field order and
`align(8)` additions also improved `to_bits` codegen to be a single
`mov` op. In turn, this led me to replace the previous
"non-shortcircuiting" impl of `PartialEq::eq` to use direct `to_bits`
comparison.

The result was remarkably better codegen across the board, even for
hastable lookups.

The current baseline codegen is as follows:
https://godbolt.org/z/zTW1h8PnY

Assuming the following example struct that mirrors with the existing
`Entity` definition:

```rust
#[derive(Clone, Copy, Eq, PartialEq, PartialOrd, Ord)]
pub struct FakeU64 {
    high: u32,
    low: u32,
}
```

the output for `to_bits` is as follows:

```
example::FakeU64::to_bits:
        shl     rdi, 32
        mov     eax, esi
        or      rax, rdi
        ret
```

Changing the struct to:
```rust
#[derive(Clone, Copy, Eq)]
#[repr(align(8))]
pub struct FakeU64 {
    low: u32,
    high: u32,
}
```
and providing manual implementations for `PartialEq`/`PartialOrd`/`Ord`,
`to_bits` now optimises to:
```
example::FakeU64::to_bits:
        mov     rax, rdi
        ret
```
The full codegen example for this PR is here for reference:
https://godbolt.org/z/n4Mjx165a

To highlight, `gt` comparison goes from
```
example::greater_than:
        cmp     edi, edx
        jae     .LBB3_2
        xor     eax, eax
        ret
.LBB3_2:
        setne   dl
        cmp     esi, ecx
        seta    al
        or      al, dl
        ret
```
to
```
example::greater_than:
        cmp     rdi, rsi
        seta    al
        ret
```

As explained on Discord by @scottmcm :

>The root issue here, as far as I understand it, is that LLVM's
middle-end is inexplicably unwilling to merge loads if that would make
them under-aligned. It leaves that entirely up to its target-specific
back-end, and thus a bunch of the things that you'd expect it to do that
would fix this just don't happen.

## Benchmarks

Before discussing benchmarks, everything was tested on the following
specs:

AMD Ryzen 7950X 16C/32T CPU
64GB 5200 RAM
AMD RX7900XT 20GB Gfx card
Manjaro KDE on Wayland

I made use of the new entity hashing benchmarks to see how this PR would
improve things there. With the changes in place, I first did an
implementation keeping the existing "non shortcircuit" `PartialEq`
implementation in place, but with the alignment and field ordering
changes, which in the benchmark is the `ord_shortcircuit` column. The
`to_bits` `PartialEq` implementation is the `ord_to_bits` column. The
main_ord column is the current existing baseline from `main` branch.


![Screenshot_20231114_132908](https://github.com/bevyengine/bevy/assets/3116268/cb9090c9-ff74-4cc5-abae-8e4561332261)

My machine is not super set-up for benchmarking, so some results are
within noise, but there's not just a clear improvement between the
non-shortcircuiting implementation, but even further optimisation taking
place with the `to_bits` implementation.

On my machine, a fair number of the stress tests were not showing any
difference (indicating other bottlenecks), but I was able to get a clear
difference with `many_foxes` with a fox count of 10,000:

Test with `cargo run --example many_foxes --features
bevy/trace_tracy,wayland --release -- --count 10000`:


![Screenshot_20231114_144217](https://github.com/bevyengine/bevy/assets/3116268/89bdc21c-7209-43c8-85ae-efbf908bfed3)

On avg, a framerate of about 28-29FPS was improved to 30-32FPS. "This
trace" represents the current PR's perf, while "External trace"
represents the `main` branch baseline.

## Changelog

Changed: micro-optimized Entity align and field ordering as well as
providing manual `PartialOrd`/`Ord` impls to help LLVM optimise further.

## Migration Guide

Any `unsafe` code relying on field ordering of `Entity` or sufficiently
cursed shenanigans should change to reflect the different internal
representation and alignment requirements of `Entity`.

Co-authored-by: james7132 <contact@jamessliu.com>
Co-authored-by: NathanW <nathansward@comcast.net>
@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Feb 14, 2025
jieyouxu added a commit to jieyouxu/rust that referenced this issue Feb 28, 2025
Update some comparison codegen tests now that they pass in LLVM20

Fixes rust-lang#106107

Needed one tweak to the default `PartialOrd::le` to get the test to pass.  Everything but the derived 2-field `le` test passes even without the change to the defaults in the trait.
jieyouxu added a commit to jieyouxu/rust that referenced this issue Feb 28, 2025
Update some comparison codegen tests now that they pass in LLVM20

Fixes rust-lang#106107

Needed one tweak to the default `PartialOrd::le` to get the test to pass.  Everything but the derived 2-field `le` test passes even without the change to the defaults in the trait.
@bors bors closed this as completed in 87cac9f Mar 1, 2025
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Mar 1, 2025
Rollup merge of rust-lang#137197 - scottmcm:cmp-20, r=ibraheemdev

Update some comparison codegen tests now that they pass in LLVM20

Fixes rust-lang#106107

Needed one tweak to the default `PartialOrd::le` to get the test to pass.  Everything but the derived 2-field `le` test passes even without the change to the defaults in the trait.
github-actions bot pushed a commit to tautschnig/verify-rust-std that referenced this issue Mar 11, 2025
Update some comparison codegen tests now that they pass in LLVM20

Fixes rust-lang#106107

Needed one tweak to the default `PartialOrd::le` to get the test to pass.  Everything but the derived 2-field `le` test passes even without the change to the defaults in the trait.
github-actions bot pushed a commit to tautschnig/verify-rust-std that referenced this issue Mar 11, 2025
Update some comparison codegen tests now that they pass in LLVM20

Fixes rust-lang#106107

Needed one tweak to the default `PartialOrd::le` to get the test to pass.  Everything but the derived 2-field `le` test passes even without the change to the defaults in the trait.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-heavy Issue: Problems and improvements with respect to binary size of generated code. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants