-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Simple cases of std::iter::Iterator::any should yield asm equivalent to for loops, but are not #43517
Comments
Documenting my thinking in the hopes that it will be useful, or at least that someone will correct me via Campbell's Law. We know that the ASM outputs are different, so let's go back a few steps until we can find where in the toolchain we want to add our improvement. Step 1: the code
I compiled this on 1.19 with Step 2: LLVM bytecode?Full LLVM bytecode here, but the important thing to note is that while iter_any is quite bloated, it looks like the for loop has already been optimized:
So Step 3: MIR?MIR output is here. Notice that the tables have turned! But that answer doesn't satisfy me either, because neither the LLVM bytecode nor the ASM we generated has any calls to So... one of these things is happening:
It's possible that both of these hypotheses are correct; maybe this is unoptimized MIR, because the MIR optimizations are doing absolutely nothing to this file! I'm still cloning rustc, but once I do I'll comment out the MIR optimizations and see where that gets us. |
Well, I can oblige that in one aspect :) You put far too much hope into MIR optimizations. They do very little right now—inlining, for example, isn't even enabled by default (only with In any case, the more interesting question is why LLVM can't optimize the iterator version as well as the alternative. Whatever goes awry in optimizing this, it happens in one of LLVM's many passes. |
The LLVM IR you posted is very weird. It has multiple branches on constants, and presumably dead code consequently. There's also several branches on After running it through define zeroext i1 @slow_iters::tests::iter_any::h9958878595c8e553() unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
%::sroa.0.i.i.i.i = alloca i8, align 1
%::sroa.3.i.i.i.i = alloca i8, align 1
call void @llvm.lifetime.start.p0i8(i64 1, i8* nonnull %::sroa.0.i.i.i.i)
call void @llvm.lifetime.start.p0i8(i64 1, i8* nonnull %::sroa.3.i.i.i.i)
store i8 1, i8* %::sroa.0.i.i.i.i, align 1
store i8 0, i8* %::sroa.3.i.i.i.i, align 1
%::sroa.0.i.i.i.i.0._0.sroa.0.i.i.i.i.0._0.sroa.0.i.i.i.i.0._0.sroa.0.i.i.i.0._0.sroa.0.i.i.0._0.sroa.0.i.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i.i.i.i = load i8, i8* %::sroa.0.i.i.i.i, align 1
%::sroa.3.i.i.i.i.0._0.sroa.3.i.i.i.i.0._0.sroa.3.i.i.i.i.0._0.sroa.3.i.i.i.0._0.sroa.3.i.i.0._0.sroa.3.i.0._0.sroa.3.0._0.sroa.3.0._0.sroa.3.0..i.i.i.i = load i8, i8* %::sroa.3.i.i.i.i, align 1
call void @llvm.lifetime.end.p0i8(i64 1, i8* nonnull %::sroa.0.i.i.i.i)
call void @llvm.lifetime.end.p0i8(i64 1, i8* nonnull %::sroa.3.i.i.i.i)
%cond.i.i.i = icmp eq i8 %::sroa.0.i.i.i.i.0._0.sroa.0.i.i.i.i.0._0.sroa.0.i.i.i.i.0._0.sroa.0.i.i.i.0._0.sroa.0.i.i.0._0.sroa.0.i.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i.i.i.i, 0
%0 = and i8 %::sroa.3.i.i.i.i.0._0.sroa.3.i.i.i.i.0._0.sroa.3.i.i.i.i.0._0.sroa.3.i.i.i.0._0.sroa.3.i.i.0._0.sroa.3.i.0._0.sroa.3.0._0.sroa.3.0._0.sroa.3.0..i.i.i.i, 1
%phitmp8 = icmp eq i8 %0, 0
%::0.i.i.i = select i1 %cond.i.i.i, i1 false, i1 %phitmp8
ret i1 %::0.i.i.i
} That's... less code, but still very stupid. Why are the adjacent loads and stores not folded away?! A third pass (just |
Neat discovery. Should we move this discussion to LLVM's bug tracker and close this task, then? |
Not necessarily, no. In the past people have tweaked the definitions of iterator methods to help the optimizer. rustc might also be able to help by changing the pass pipeline or by generating better/different IR in the first place. |
This did optimize out in LLVM 3.9, and was broken by LLVM 4.0. |
In more detail, LLVM 3.9 generates the following small code for
While LLVM 4.0 manages to screw up and generate this monstrosity:
I think the |
Going closer to the root: LLVM 3.9 optimizes the closure in
While LLVM 4.0 for some reason generates this:
|
The initial nonequivalence in the filing of the bug report is due to the slice iterator's explicit manual unrolling by 4 in its implementation of any. That might seem bad but as soon as llvm learns to loop optimize a loop with two exits (either the element was found or the iterator reached the end), the explicit unrolling can be removed. |
rustc: don't use union layouts for tagged union enums. Fixes #46897, fixes #43517 (AFAICT from the testcases). This PR doesn't add any testcases, we should try to at least get perf ones (cc @Mark-Simulacrum). I couldn't find an example in those issues where the choice of LLVM array vs struct (with N identical fields) for padding filler types is still needed, *on top of* this change, to prevent excessive LLVM sinking. r? @arielb1
@eddyb is the PR that fixed this intended to be a permanent solution, or is it just a workaround? IOW, should we open a new bug to track reevaluating this fix when LLVM 30188 is resolved? |
That LLVM bug is now tracked at llvm/llvm-project#29546 and appears to have had no motion on it in the past 5 years. Assembly in the initial comparison cases remains identical per https://rust.godbolt.org/z/rWYesEaxq |
This was nominally resolved on our end, but was left open in the possibility that the LLVM issue was going to be resolved, so as to remove the "temporary" workaround. However, if we wish to track LLVM-side bugs that are not currently an impact on us, but may allow upgrades later, then I believe we may wish to consider tracking them in another form than an issue on the GH tracker, as they are "not an issue", so to speak. Alternatively, if we want to track them on this tracker, we may wish to track them in a more organized fashion (e.g. a metabug). I am going to close this issue. |
The LLVM issue has long since been fixed, so it's possible that the workaround may be removed. |
I think any workarounds here are long gone, since |
I was looking at the generated asm for
to see what happens to the array, and was surprised to see a large function with 5(!) call sites for
memcmp
. The generated code doesn't seem to vary with the number of elements in the array.That got me to compare with the functionally equivalent code:
which doesn't yield the same assembly at all. The first difference being that the latter doesn't even store the array (only the strs) while the former does.
That got me even further in comparing
Iterator::any
with equivalentfor
loops, up to the extreme (and stupid):vs.
The latter generates the simplest code possible:
The former not so much:
The text was updated successfully, but these errors were encountered: