Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Simple cases of std::iter::Iterator::any should yield asm equivalent to for loops, but are not #43517

Closed
glandium opened this issue Jul 28, 2017 · 16 comments · Fixed by #47007
Closed
Assignees
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@glandium
Copy link
Contributor

I was looking at the generated asm for

pub fn foo(s: &str) -> bool {
    ["foo", "bar", "baz"].iter().any(|&x| s.starts_with(x))
}

to see what happens to the array, and was surprised to see a large function with 5(!) call sites for memcmp. The generated code doesn't seem to vary with the number of elements in the array.

That got me to compare with the functionally equivalent code:

pub fn foo(s: &str) -> bool {
    for x in ["foo", "bar", "baz"].iter() {
        if s.starts_with(x) { return true; }
    }
    return false;
}

which doesn't yield the same assembly at all. The first difference being that the latter doesn't even store the array (only the strs) while the former does.

That got me even further in comparing Iterator::any with equivalent for loops, up to the extreme (and stupid):

pub fn foo() -> bool {
    [1].iter().any(|&x| x == 1)
}

vs.

pub fn foo() -> bool {
    for &x in [1].into_iter() {
        if x == 1 { return true; }
    }
    return false;
}

The latter generates the simplest code possible:

_ZN10playground3foo17h1a4864584facf3ddE:
	.cfi_startproc
	movb	$1, %al
	retq

The former not so much:

_ZN10playground3foo17h488d39c1724807f0E:
	.cfi_startproc
	subq	$2, %rsp
.Lcfi0:
	.cfi_def_cfa_offset 10
	xorl	%eax, %eax
	leaq	ref.5(%rip), %rcx
	movq	%rsp, %rdx
	leaq	1(%rsp), %rsi
	.p2align	4, 0x90
.LBB0_1:
	cmpq	$4, %rax
	je	.LBB0_2
	cmpl	$1, (%rax,%rcx)
	movq	%rdx, %rdi
	jne	.LBB0_5
	movb	$1, (%rsp)
	movq	%rsi, %rdi
.LBB0_5:
	movb	$0, (%rdi)
	addq	$4, %rax
	cmpb	$0, (%rsp)
	je	.LBB0_1
	testb	$1, 1(%rsp)
	sete	%al
	jmp	.LBB0_7
.LBB0_2:
	xorl	%eax, %eax
.LBB0_7:
	addq	$2, %rsp
	retq
.Lfunc_end0:
	.size	_ZN10playground3foo17h488d39c1724807f0E, .Lfunc_end0-_ZN10playground3foo17h488d39c1724807f0E
	.cfi_endproc

	.type	ref.5,@object
	.section	.rodata.cst4,"aM",@progbits,4
	.p2align	2
ref.5:
	.long	1
	.size	ref.5, 4
@steveklabnik steveklabnik added A-codegen Area: Code generation I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jul 28, 2017
@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 28, 2017
@jthemphill
Copy link

Documenting my thinking in the hopes that it will be useful, or at least that someone will correct me via Campbell's Law.

We know that the ASM outputs are different, so let's go back a few steps until we can find where in the toolchain we want to add our improvement.

Step 1: the code

pub mod slow_iters {

    pub fn iter_any() -> bool {
        [1].iter().any(|&x| x == 1)
    }

    pub fn for_loop() -> bool {
        for &x in [1].into_iter() {
            if x == 1 { return true; }
        }
        return false;
    }

}

I compiled this on 1.19 with rustc --crate-type lib --emit=asm,llvm-bc,mir -C opt-level=3

Step 2: LLVM bytecode?

Full LLVM bytecode here, but the important thing to note is that while iter_any is quite bloated, it looks like the for loop has already been optimized:

; Function Attrs: nounwind readonly uwtable
define zeroext i1 @_ZN10slow_iters5tests8for_loop17hf41fef8312c32c27E() unnamed_addr #0 {
bb9:
  ret i1 true
}

So for_loop is just as trivial in LLVM bytecode as it is in assembler - whatever optimizations we needed to do have already happened. We need to go back farther.

Step 3: MIR?

MIR output is here. Notice that the tables have turned! iter_any is more concise than for_loop, so the code we need to add should change the MIR output. Note that iter_any calls into std::iter::Iterator::any on line 187. So the problem here seems to be that we aren't inlining std::iter::Iterator::any before we generate LLVM bytecode.

But that answer doesn't satisfy me either, because neither the LLVM bytecode nor the ASM we generated has any calls to std::iter::Iterator, and in fact std::iter::Iterator::any is marked #[inline].

So... one of these things is happening:

  1. We're failing to inline std::iter::Iterator::any, and we haven't noticed because LLVM's been inlining this code for us. LLVM doesn't have as much knowledge as we do, so it can't do as good a job as we theoretically could.
  2. I've been looking at unoptimized MIR this whole time. This hypothesis seems to be borne out by the fact that when I rerun rustc without optimizations, I get exactly the same MIR output. But doesn't MIR optimization occur in phase 3, and don't we emit MIR after that phase?

It's possible that both of these hypotheses are correct; maybe this is unoptimized MIR, because the MIR optimizations are doing absolutely nothing to this file! I'm still cloning rustc, but once I do I'll comment out the MIR optimizations and see where that gets us.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Jul 29, 2017

or at least that someone will correct me

Well, I can oblige that in one aspect :) You put far too much hope into MIR optimizations. They do very little right now—inlining, for example, isn't even enabled by default (only with -C mir-opt-level=3). Even in the future, when all the MIR optimizations we've ever wanted are implemented, they will only cover a small subset of what LLVM does (there's no point in duplicating all the work that went into LLVM, and that would be impractical anyway). This code looks sufficiently messy that I don't expect MIR optimizations to ever fully optimize it away. For example, that would probably require loop unrolling, which hasn't ever been proposed AFAIK (and if it was, I'd be skeptical of it).

In any case, the more interesting question is why LLVM can't optimize the iterator version as well as the alternative. Whatever goes awry in optimizing this, it happens in one of LLVM's many passes.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Jul 29, 2017

The LLVM IR you posted is very weird. It has multiple branches on constants, and presumably dead code consequently. There's also several branches on undef values?!? These should all have been cleaned up.

After running it through opt -O3 again, I get this IR (names are all messed up cause I did it on gcc.godbolt.org, which insists on applying C++ name demangling):

define zeroext i1 @slow_iters::tests::iter_any::h9958878595c8e553() unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  %::sroa.0.i.i.i.i = alloca i8, align 1
  %::sroa.3.i.i.i.i = alloca i8, align 1
  call void @llvm.lifetime.start.p0i8(i64 1, i8* nonnull %::sroa.0.i.i.i.i)
  call void @llvm.lifetime.start.p0i8(i64 1, i8* nonnull %::sroa.3.i.i.i.i)
  store i8 1, i8* %::sroa.0.i.i.i.i, align 1
  store i8 0, i8* %::sroa.3.i.i.i.i, align 1
  %::sroa.0.i.i.i.i.0._0.sroa.0.i.i.i.i.0._0.sroa.0.i.i.i.i.0._0.sroa.0.i.i.i.0._0.sroa.0.i.i.0._0.sroa.0.i.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i.i.i.i = load i8, i8* %::sroa.0.i.i.i.i, align 1
  %::sroa.3.i.i.i.i.0._0.sroa.3.i.i.i.i.0._0.sroa.3.i.i.i.i.0._0.sroa.3.i.i.i.0._0.sroa.3.i.i.0._0.sroa.3.i.0._0.sroa.3.0._0.sroa.3.0._0.sroa.3.0..i.i.i.i = load i8, i8* %::sroa.3.i.i.i.i, align 1
  call void @llvm.lifetime.end.p0i8(i64 1, i8* nonnull %::sroa.0.i.i.i.i)
  call void @llvm.lifetime.end.p0i8(i64 1, i8* nonnull %::sroa.3.i.i.i.i)
  %cond.i.i.i = icmp eq i8 %::sroa.0.i.i.i.i.0._0.sroa.0.i.i.i.i.0._0.sroa.0.i.i.i.i.0._0.sroa.0.i.i.i.0._0.sroa.0.i.i.0._0.sroa.0.i.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i.i.i.i, 0
  %0 = and i8 %::sroa.3.i.i.i.i.0._0.sroa.3.i.i.i.i.0._0.sroa.3.i.i.i.i.0._0.sroa.3.i.i.i.0._0.sroa.3.i.i.0._0.sroa.3.i.0._0.sroa.3.0._0.sroa.3.0._0.sroa.3.0..i.i.i.i, 1
  %phitmp8 = icmp eq i8 %0, 0
  %::0.i.i.i = select i1 %cond.i.i.i, i1 false, i1 %phitmp8
  ret i1 %::0.i.i.i
}

That's... less code, but still very stupid. Why are the adjacent loads and stores not folded away?!

A third pass (just -O1 at this point) gets rid of them, constant folds the rest, and winds up with ret i1 true. Looks like a nasty phase ordering issue. But what about this IR confuses LLVM so much?

@jthemphill
Copy link

Neat discovery. Should we move this discussion to LLVM's bug tracker and close this task, then?

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Jul 30, 2017

Not necessarily, no. In the past people have tweaked the definitions of iterator methods to help the optimizer. rustc might also be able to help by changing the pass pipeline or by generating better/different IR in the first place.

@arielb1
Copy link
Contributor

arielb1 commented Jul 30, 2017

This did optimize out in LLVM 3.9, and was broken by LLVM 4.0.

@arielb1
Copy link
Contributor

arielb1 commented Jul 30, 2017

In more detail, LLVM 3.9 generates the following small code for search_while (noalias metadata stripped - it doesn't matter, with this function substituted in both 3.9 and 4.0 manage to optimize everything out):


define zeroext i1 @"_ZN49_$LT$core..slice..Iter$LT$$u27$a$C$$u20$T$GT$$GT$12search_while_good17hc287fe675e01e0b8E"(%"core::slice::Iter<i32>"* nocapture dereferenceable(16), i1 zeroext) unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  %2 = getelementptr inbounds %"core::slice::Iter<i32>", %"core::slice::Iter<i32>"* %0, i64 0, i32 0
  %3 = getelementptr inbounds %"core::slice::Iter<i32>", %"core::slice::Iter<i32>"* %0, i64 0, i32 2
  %4 = bitcast i32** %3 to i64*
  %.pre = load i32*, i32** %2, align 8
  %.pre147 = load i64, i64* %4, align 8
  %5 = inttoptr i64 %.pre147 to i32*
  br label %bb6

bb6:                                              ; preds = %bb56, %start
  %6 = phi i32* [ %17, %bb56 ], [ %.pre, %start ]
  %7 = ptrtoint i32* %6 to i64
  %8 = sub i64 %.pre147, %7
  %9 = sdiv i64 %8, 4
  %10 = icmp ugt i64 %9, 3
  br i1 %10, label %bb15, label %bb61.preheader

bb61.preheader:                                   ; preds = %bb6
  br label %bb61

bb15:                                             ; preds = %bb6
  %11 = getelementptr inbounds i32, i32* %6, i64 1
  store i32* %11, i32** %2, align 8
  %12 = load i32, i32* %6, align 4
  %not..i140 = icmp eq i32 %12, 1
  br i1 %not..i140, label %bb19.loopexit152, label %bb32

bb19.loopexit:                                    ; preds = %bb71, %bb61
  %_0.0.ph = phi i1 [ %1, %bb61 ], [ false, %bb71 ]
  br label %bb19

bb19.loopexit152:                                 ; preds = %bb15, %bb32, %bb44, %bb56
  br label %bb19

bb19:                                             ; preds = %bb19.loopexit152, %bb19.loopexit
  %_0.0 = phi i1 [ %_0.0.ph, %bb19.loopexit ], [ false, %bb19.loopexit152 ]
  ret i1 %_0.0

bb32:                                             ; preds = %bb15
  %13 = getelementptr inbounds i32, i32* %6, i64 2
  store i32* %13, i32** %2, align 8
  %14 = load i32, i32* %11, align 4
  %not..i138 = icmp eq i32 %14, 1
  br i1 %not..i138, label %bb19.loopexit152, label %bb44

bb44:                                             ; preds = %bb32
  %15 = getelementptr inbounds i32, i32* %6, i64 3
  store i32* %15, i32** %2, align 8
  %16 = load i32, i32* %13, align 4
  %not..i136 = icmp eq i32 %16, 1
  br i1 %not..i136, label %bb19.loopexit152, label %bb56

bb56:                                             ; preds = %bb44
  %17 = getelementptr inbounds i32, i32* %6, i64 4
  store i32* %17, i32** %2, align 8
  %18 = load i32, i32* %15, align 4
  %not..i134 = icmp eq i32 %18, 1
  br i1 %not..i134, label %bb19.loopexit152, label %bb6

bb61:                                             ; preds = %bb61.preheader, %bb71
  %19 = phi i32* [ %21, %bb71 ], [ %6, %bb61.preheader ]
  %20 = icmp eq i32* %19, %5
  br i1 %20, label %bb19.loopexit, label %bb71

bb71:                                             ; preds = %bb61
  %21 = getelementptr inbounds i32, i32* %19, i64 1
  store i32* %21, i32** %2, align 8
  %22 = load i32, i32* %19, align 4
  %not..i = icmp eq i32 %22, 1
  br i1 %not..i, label %bb19.loopexit, label %bb61
}

While LLVM 4.0 manages to screw up and generate this monstrosity:

define zeroext i1 @"_ZN49_$LT$core..slice..Iter$LT$$u27$a$C$$u20$T$GT$$GT$12search_while_bad17hc287fe675e01e0b8E"(%"core::slice::Iter<i32>"* nocapture dereferenceable(16), i1 zeroext) unnamed_addr #1 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  %_0.sroa.0.i164 = alloca i8, align 1
  %_0.sroa.3.i165 = alloca i8, align 1
  %_0.sroa.0.i153 = alloca i8, align 1
  %_0.sroa.3.i154 = alloca i8, align 1
  %_0.sroa.0.i142 = alloca i8, align 1
  %_0.sroa.3.i143 = alloca i8, align 1
  %_0.sroa.0.i131 = alloca i8, align 1
  %_0.sroa.3.i132 = alloca i8, align 1
  %_0.sroa.0.i = alloca i8, align 1
  %_0.sroa.3.i = alloca i8, align 1
  %2 = getelementptr inbounds %"core::slice::Iter<i32>", %"core::slice::Iter<i32>"* %0, i64 0, i32 0
  %3 = getelementptr inbounds %"core::slice::Iter<i32>", %"core::slice::Iter<i32>"* %0, i64 0, i32 2
  %4 = bitcast i32** %3 to i64*
  %.pre = load i32*, i32** %2, align 8
  %.pre180 = load i64, i64* %4, align 8
  %5 = inttoptr i64 %.pre180 to i32*
  br label %bb6

bb6:                                              ; preds = %start, %bb56
  %6 = phi i32* [ %.pre, %start ], [ %22, %bb56 ]
  %7 = ptrtoint i32* %6 to i64
  %8 = sub i64 %.pre180, %7
  %9 = sdiv i64 %8, 4
  %10 = icmp ugt i64 %9, 3
  br i1 %10, label %bb14, label %bb61.preheader

bb61.preheader:                                   ; preds = %bb6
  br label %bb61

bb14:                                             ; preds = %bb6
  %11 = getelementptr inbounds i32, i32* %6, i64 1
  store i32* %11, i32** %2, align 8
  %12 = icmp ne i32* %6, null
  tail call void @llvm.assume(i1 %12)
  call void @llvm.lifetime.start(i64 1, i8* nonnull %_0.sroa.0.i164)
  call void @llvm.lifetime.start(i64 1, i8* nonnull %_0.sroa.3.i165)
  %13 = load i32, i32* %6, align 4
  %14 = icmp eq i32 %13, 1
  br i1 %14, label %bb3.i166, label %bb15

bb3.i166:                                         ; preds = %bb14
  store i8 1, i8* %_0.sroa.0.i164, align 1
  br label %bb15

bb15:                                             ; preds = %bb3.i166, %bb14
  %.sink.i167 = phi i8* [ %_0.sroa.3.i165, %bb3.i166 ], [ %_0.sroa.0.i164, %bb14 ]
  store i8 0, i8* %.sink.i167, align 1
  %_0.sroa.0.i164.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i168 = load i8, i8* %_0.sroa.0.i164, align 1
  %_0.sroa.3.i165.0._0.sroa.3.0._0.sroa.3.0._0.sroa.3.0..i169 = load i8, i8* %_0.sroa.3.i165, align 1
  %_0.sroa.3.0.insert.ext.i170 = zext i8 %_0.sroa.3.i165.0._0.sroa.3.0._0.sroa.3.0._0.sroa.3.0..i169 to i16
  %_0.sroa.3.0.insert.shift.i171 = shl nuw i16 %_0.sroa.3.0.insert.ext.i170, 8
  %_0.sroa.0.0.insert.ext.i172 = zext i8 %_0.sroa.0.i164.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i168 to i16
  %_0.sroa.0.0.insert.insert.i173 = or i16 %_0.sroa.3.0.insert.shift.i171, %_0.sroa.0.0.insert.ext.i172
  call void @llvm.lifetime.end(i64 1, i8* nonnull %_0.sroa.0.i164)
  call void @llvm.lifetime.end(i64 1, i8* nonnull %_0.sroa.3.i165)
  %cond2 = icmp eq i8 %_0.sroa.0.i164.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i168, 0
  br i1 %cond2, label %bb31, label %bb22.loopexit185

bb19:                                             ; preds = %bb62, %bb22
  %_0.0 = phi i8 [ %_0.1, %bb22 ], [ %27, %bb62 ]
  %15 = icmp ne i8 %_0.0, 0
  ret i1 %15

bb22.loopexit:                                    ; preds = %bb71
  %_0.sroa.3.0.insert.ext.i.le = zext i8 %_0.sroa.3.i.0._0.sroa.3.0._0.sroa.3.0._0.sroa.3.0..i to i16
  %_0.sroa.3.0.insert.shift.i.le = shl nuw i16 %_0.sroa.3.0.insert.ext.i.le, 8
  %_0.sroa.0.0.insert.ext.i.le = zext i8 %_0.sroa.0.i.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i to i16
  %_0.sroa.0.0.insert.insert.i.le = or i16 %_0.sroa.3.0.insert.shift.i.le, %_0.sroa.0.0.insert.ext.i.le
  br label %bb22

bb22.loopexit185:                                 ; preds = %bb15, %bb32, %bb44, %bb56
  %_0.1.in.in.in.ph = phi i16 [ %_0.sroa.0.0.insert.insert.i140, %bb56 ], [ %_0.sroa.0.0.insert.insert.i151, %bb44 ], [ %_0.sroa.0.0.insert.insert.i162, %bb32 ], [ %_0.sroa.0.0.insert.insert.i173, %bb15 ]
  br label %bb22

bb22:                                             ; preds = %bb22.loopexit185, %bb22.loopexit
  %_0.1.in.in.in = phi i16 [ %_0.sroa.0.0.insert.insert.i.le, %bb22.loopexit ], [ %_0.1.in.in.in.ph, %bb22.loopexit185 ]
  %_0.1.in.in = lshr i16 %_0.1.in.in.in, 8
  %_0.1.in = trunc i16 %_0.1.in.in to i8
  %_0.1 = and i8 %_0.1.in, 1
  br label %bb19

bb31:                                             ; preds = %bb15
  %16 = getelementptr inbounds i32, i32* %6, i64 2
  store i32* %16, i32** %2, align 8
  call void @llvm.lifetime.start(i64 1, i8* nonnull %_0.sroa.0.i153)
  call void @llvm.lifetime.start(i64 1, i8* nonnull %_0.sroa.3.i154)
  %17 = load i32, i32* %11, align 4
  %18 = icmp eq i32 %17, 1
  br i1 %18, label %bb3.i155, label %bb32

bb3.i155:                                         ; preds = %bb31
  store i8 1, i8* %_0.sroa.0.i153, align 1
  br label %bb32

bb32:                                             ; preds = %bb3.i155, %bb31
  %.sink.i156 = phi i8* [ %_0.sroa.3.i154, %bb3.i155 ], [ %_0.sroa.0.i153, %bb31 ]
  store i8 0, i8* %.sink.i156, align 1
  %_0.sroa.0.i153.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i157 = load i8, i8* %_0.sroa.0.i153, align 1
  %_0.sroa.3.i154.0._0.sroa.3.0._0.sroa.3.0._0.sroa.3.0..i158 = load i8, i8* %_0.sroa.3.i154, align 1
  %_0.sroa.3.0.insert.ext.i159 = zext i8 %_0.sroa.3.i154.0._0.sroa.3.0._0.sroa.3.0._0.sroa.3.0..i158 to i16
  %_0.sroa.3.0.insert.shift.i160 = shl nuw i16 %_0.sroa.3.0.insert.ext.i159, 8
  %_0.sroa.0.0.insert.ext.i161 = zext i8 %_0.sroa.0.i153.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i157 to i16
  %_0.sroa.0.0.insert.insert.i162 = or i16 %_0.sroa.3.0.insert.shift.i160, %_0.sroa.0.0.insert.ext.i161
  call void @llvm.lifetime.end(i64 1, i8* nonnull %_0.sroa.0.i153)
  call void @llvm.lifetime.end(i64 1, i8* nonnull %_0.sroa.3.i154)
  %cond4 = icmp eq i8 %_0.sroa.0.i153.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i157, 0
  br i1 %cond4, label %bb43, label %bb22.loopexit185

bb43:                                             ; preds = %bb32
  %19 = getelementptr inbounds i32, i32* %6, i64 3
  store i32* %19, i32** %2, align 8
  call void @llvm.lifetime.start(i64 1, i8* nonnull %_0.sroa.0.i142)
  call void @llvm.lifetime.start(i64 1, i8* nonnull %_0.sroa.3.i143)
  %20 = load i32, i32* %16, align 4
  %21 = icmp eq i32 %20, 1
  br i1 %21, label %bb3.i144, label %bb44

bb3.i144:                                         ; preds = %bb43
  store i8 1, i8* %_0.sroa.0.i142, align 1
  br label %bb44

bb44:                                             ; preds = %bb3.i144, %bb43
  %.sink.i145 = phi i8* [ %_0.sroa.3.i143, %bb3.i144 ], [ %_0.sroa.0.i142, %bb43 ]
  store i8 0, i8* %.sink.i145, align 1
  %_0.sroa.0.i142.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i146 = load i8, i8* %_0.sroa.0.i142, align 1
  %_0.sroa.3.i143.0._0.sroa.3.0._0.sroa.3.0._0.sroa.3.0..i147 = load i8, i8* %_0.sroa.3.i143, align 1
  %_0.sroa.3.0.insert.ext.i148 = zext i8 %_0.sroa.3.i143.0._0.sroa.3.0._0.sroa.3.0._0.sroa.3.0..i147 to i16
  %_0.sroa.3.0.insert.shift.i149 = shl nuw i16 %_0.sroa.3.0.insert.ext.i148, 8
  %_0.sroa.0.0.insert.ext.i150 = zext i8 %_0.sroa.0.i142.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i146 to i16
  %_0.sroa.0.0.insert.insert.i151 = or i16 %_0.sroa.3.0.insert.shift.i149, %_0.sroa.0.0.insert.ext.i150
  call void @llvm.lifetime.end(i64 1, i8* nonnull %_0.sroa.0.i142)
  call void @llvm.lifetime.end(i64 1, i8* nonnull %_0.sroa.3.i143)
  %cond6 = icmp eq i8 %_0.sroa.0.i142.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i146, 0
  br i1 %cond6, label %bb55, label %bb22.loopexit185

bb55:                                             ; preds = %bb44
  %22 = getelementptr inbounds i32, i32* %6, i64 4
  store i32* %22, i32** %2, align 8
  call void @llvm.lifetime.start(i64 1, i8* nonnull %_0.sroa.0.i131)
  call void @llvm.lifetime.start(i64 1, i8* nonnull %_0.sroa.3.i132)
  %23 = load i32, i32* %19, align 4
  %24 = icmp eq i32 %23, 1
  br i1 %24, label %bb3.i133, label %bb56

bb3.i133:                                         ; preds = %bb55
  store i8 1, i8* %_0.sroa.0.i131, align 1
  br label %bb56

bb56:                                             ; preds = %bb3.i133, %bb55
  %.sink.i134 = phi i8* [ %_0.sroa.3.i132, %bb3.i133 ], [ %_0.sroa.0.i131, %bb55 ]
  store i8 0, i8* %.sink.i134, align 1
  %_0.sroa.0.i131.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i135 = load i8, i8* %_0.sroa.0.i131, align 1
  %_0.sroa.3.i132.0._0.sroa.3.0._0.sroa.3.0._0.sroa.3.0..i136 = load i8, i8* %_0.sroa.3.i132, align 1
  %_0.sroa.3.0.insert.ext.i137 = zext i8 %_0.sroa.3.i132.0._0.sroa.3.0._0.sroa.3.0._0.sroa.3.0..i136 to i16
  %_0.sroa.3.0.insert.shift.i138 = shl nuw i16 %_0.sroa.3.0.insert.ext.i137, 8
  %_0.sroa.0.0.insert.ext.i139 = zext i8 %_0.sroa.0.i131.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i135 to i16
  %_0.sroa.0.0.insert.insert.i140 = or i16 %_0.sroa.3.0.insert.shift.i138, %_0.sroa.0.0.insert.ext.i139
  call void @llvm.lifetime.end(i64 1, i8* nonnull %_0.sroa.0.i131)
  call void @llvm.lifetime.end(i64 1, i8* nonnull %_0.sroa.3.i132)
  %cond8 = icmp eq i8 %_0.sroa.0.i131.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i135, 0
  br i1 %cond8, label %bb6, label %bb22.loopexit185

bb61:                                             ; preds = %bb61.preheader, %bb71
  %25 = phi i32* [ %28, %bb71 ], [ %6, %bb61.preheader ]
  %26 = icmp eq i32* %25, %5
  br i1 %26, label %bb62, label %bb70

bb62:                                             ; preds = %bb61
  %27 = zext i1 %1 to i8
  br label %bb19

bb70:                                             ; preds = %bb61
  %28 = getelementptr inbounds i32, i32* %25, i64 1
  store i32* %28, i32** %2, align 8
  %29 = icmp ne i32* %25, null
  tail call void @llvm.assume(i1 %29)
  call void @llvm.lifetime.start(i64 1, i8* nonnull %_0.sroa.0.i)
  call void @llvm.lifetime.start(i64 1, i8* nonnull %_0.sroa.3.i)
  %30 = load i32, i32* %25, align 4
  %31 = icmp eq i32 %30, 1
  br i1 %31, label %bb3.i, label %bb71

bb3.i:                                            ; preds = %bb70
  store i8 1, i8* %_0.sroa.0.i, align 1
  br label %bb71

bb71:                                             ; preds = %bb3.i, %bb70
  %.sink.i = phi i8* [ %_0.sroa.3.i, %bb3.i ], [ %_0.sroa.0.i, %bb70 ]
  store i8 0, i8* %.sink.i, align 1
  %_0.sroa.0.i.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i = load i8, i8* %_0.sroa.0.i, align 1
  %_0.sroa.3.i.0._0.sroa.3.0._0.sroa.3.0._0.sroa.3.0..i = load i8, i8* %_0.sroa.3.i, align 1
  call void @llvm.lifetime.end(i64 1, i8* nonnull %_0.sroa.0.i)
  call void @llvm.lifetime.end(i64 1, i8* nonnull %_0.sroa.3.i)
  %cond = icmp eq i8 %_0.sroa.0.i.0._0.sroa.0.0._0.sroa.0.0._0.sroa.0.0..i, 0
  br i1 %cond, label %bb61, label %bb22.loopexit
}

I think the %.sink.i145 somehow prevents SROA.

@arielb1
Copy link
Contributor

arielb1 commented Jul 30, 2017

Going closer to the root:

LLVM 3.9 optimizes the closure in all to this:


define i16 @"_ZN91_$LT$core..slice..Iter$LT$$u27$a$C$$u20$T$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$3all28_$u7b$$u7b$closure$u7d$$u7d$17h5b298b7604579d38E"(%closure* nocapture readnone, i32* noalias nocapture readonly dereferenceable(4)) unnamed_addr #0 {
start:
  %2 = load i32, i32* %1, align 4
  %not. = icmp eq i32 %2, 1
  %. = zext i1 %not. to i16
  ret i16 %.
}

While LLVM 4.0 for some reason generates this:

; Function Attrs: norecurse nounwind readonly uwtable
define i16 @"_ZN91_$LT$core..slice..Iter$LT$$u27$a$C$$u20$T$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$3all_28_$u7b$$u7b$closure$u7d$$u7d$17h5b298b7604579d38E"(%closure* nocapture readnone, i32* noalias nocapture readonly dereferenceable(4)) unnamed_addr #0 {
start:
  %_0.sroa.0 = alloca i8, align 1
  %_0.sroa.3 = alloca i8, align 1
  %2 = load i32, i32* %1, align 4
  %3 = icmp eq i32 %2, 1
  br i1 %3, label %bb3, label %bb4

bb3:                                              ; preds = %start
  store i8 1, i8* %_0.sroa.0, align 1
  br label %bb4

bb4:                                              ; preds = %start, %bb3
  %.sink = phi i8* [ %_0.sroa.3, %bb3 ], [ %_0.sroa.0, %start ]
  store i8 0, i8* %.sink, align 1
  %_0.sroa.0.0._0.sroa.0.0._0.sroa.0.0. = load i8, i8* %_0.sroa.0, align 1
  %_0.sroa.3.0._0.sroa.3.0._0.sroa.3.0. = load i8, i8* %_0.sroa.3, align 1
  %_0.sroa.3.0.insert.ext = zext i8 %_0.sroa.3.0._0.sroa.3.0._0.sroa.3.0. to i16
  %_0.sroa.3.0.insert.shift = shl nuw i16 %_0.sroa.3.0.insert.ext, 8
  %_0.sroa.0.0.insert.ext = zext i8 %_0.sroa.0.0._0.sroa.0.0._0.sroa.0.0. to i16
  %_0.sroa.0.0.insert.insert = or i16 %_0.sroa.3.0.insert.shift, %_0.sroa.0.0.insert.ext
  ret i16 %_0.sroa.0.0.insert.insert
}

@arielb1 arielb1 self-assigned this Jul 30, 2017
@arielb1
Copy link
Contributor

arielb1 commented Jul 30, 2017

caused by https://bugs.llvm.org/show_bug.cgi?id=30188

@bluss
Copy link
Member

bluss commented Sep 24, 2017

The initial nonequivalence in the filing of the bug report is due to the slice iterator's explicit manual unrolling by 4 in its implementation of any. That might seem bad but as soon as llvm learns to loop optimize a loop with two exits (either the element was found or the iterator reached the end), the explicit unrolling can be removed.

bors added a commit that referenced this issue Dec 27, 2017
rustc: don't use union layouts for tagged union enums.

Fixes #46897, fixes #43517 (AFAICT from the testcases).
This PR doesn't add any testcases, we should try to at least get perf ones (cc @Mark-Simulacrum).
I couldn't find an example in those issues where the choice of LLVM array vs struct (with N identical fields) for padding filler types is still needed, *on top of* this change, to prevent excessive LLVM sinking.

r? @arielb1
@bstrie
Copy link
Contributor

bstrie commented Jan 3, 2018

@eddyb is the PR that fixed this intended to be a permanent solution, or is it just a workaround? IOW, should we open a new bug to track reevaluating this fix when LLVM 30188 is resolved?

@eddyb
Copy link
Member

eddyb commented Jan 3, 2018

@bstrie Hmmm, @arielb1 mentioned this bug, but you're right, this is older so me "fixing it" is probably more coincidental than anything else.

@eddyb eddyb reopened this Jan 3, 2018
@workingjubilee
Copy link
Member

That LLVM bug is now tracked at llvm/llvm-project#29546 and appears to have had no motion on it in the past 5 years.

Assembly in the initial comparison cases remains identical per https://rust.godbolt.org/z/rWYesEaxq
No other changes, as far as I can tell.

@workingjubilee
Copy link
Member

This was nominally resolved on our end, but was left open in the possibility that the LLVM issue was going to be resolved, so as to remove the "temporary" workaround. However, if we wish to track LLVM-side bugs that are not currently an impact on us, but may allow upgrades later, then I believe we may wish to consider tracking them in another form than an issue on the GH tracker, as they are "not an issue", so to speak. Alternatively, if we want to track them on this tracker, we may wish to track them in a more organized fashion (e.g. a metabug).

I am going to close this issue.
Do feel free to reopen this if it regresses again, or it becomes otherwise currently actionable, or if there is a particular reason we must have an open issue on our end to track this LLVM issue for potential improvements as opposed to any of the other 18308 currently-open LLVM issues we could be potentially waiting on in order to make an improvement.

@nikic
Copy link
Contributor

nikic commented Aug 3, 2022

The LLVM issue has long since been fixed, so it's possible that the workaround may be removed.

@scottmcm
Copy link
Member

scottmcm commented Aug 3, 2022

I think any workarounds here are long gone, since any moved to internal iteration way back in https://github.com/rust-lang/rust/pull/45595/files#diff-0979f647240ba80c7d777de5ccaf94c7cacfe91985a5e2ef7a020dca053c5a8eR1573

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.