StepBy<_, Range<_>> optimises poorly #31155

huonw · 2016-01-24T01:01:04Z

There's a lot going on inside StepBy<_, Range<_>>'s Iterator implementation and LLVM does a reasonable job of cutting things down, but doesn't get all the way (definition inlined for context if it changes in future, and to allow easy experimentation):

#![crate_type = "lib"]

use std::mem;

struct R {
    start: i32,
    end: i32,
    step_by: i32,
}

impl Iterator for R
{
    type Item = i32;

    fn next(&mut self) -> Option<i32> {
        let rev = self.step_by < 0;
        if (rev && self.start > self.end) ||
           (!rev && self.start < self.end)
        {
            match self.start.checked_add(self.step_by) { // inlined & stable version of <i32 as std::iter::Step>::step
                Some(mut n) => {
                    mem::swap(&mut self.start, &mut n);
                    Some(n)
                },
                None => {
                    let mut n = self.end.clone();
                    mem::swap(&mut self.start, &mut n);
                    Some(n)
                }
            }
        } else {
            None
        }
    }
}


pub fn iterator() -> usize {
    R { start: 0, end: 100000, step_by: 2 }.count()
}

pub fn while_() -> usize {
    let mut count = 0;
    let mut i = 0;
    while i < 100000 {
        count += 1;
        i += 2;
    }
    count
}

Optimised asm (it'd be great for the first to be like the second):

_ZN8iterator20h37c73f52cfd206d4LbaE:
    .cfi_startproc
    xorl    %eax, %eax
    movl    $100000, %ecx
    xorl    %edx, %edx
    .align  16, 0x90
.LBB1_1:
    addl    $2, %edx
    cmovol  %ecx, %edx
    incq    %rax
    cmpl    $100000, %edx
    jl  .LBB1_1
    retq

_ZN6while_20h2b6d290fcc3d36d0UbaE:
    .cfi_startproc
    movl    $50000, %eax
    retq

https://play.rust-lang.org/?gist=a926869a4cf59d6683c4
#24660 previously had a somewhat similar problem, although this one is compounded by using checked_add implemented in terms of LLVM's overflow intrinsics, which the LLVM performance tips explicitly recommend against:

Avoid using arithmetic intrinsics unless you are required by your source language specification to emit a particular code sequence. The optimizer is quite good at reasoning about general control flow and arithmetic, it is not anywhere near as strong at reasoning about the various intrinsics. If profitable for code generation purposes, the optimizer will likely form the intrinsics itself late in the optimization pipeline. It is very rarely profitable to emit these directly in the language frontend. This item explicitly includes the use of the overflow intrinsics.

The text was updated successfully, but these errors were encountered:

bluss · 2016-01-24T13:21:47Z

Here's my ideal step by: https://play.rust-lang.org/?gist=f967691b33eef2f2de70&version=stable

For example, the boundary checks in the loop are eliminated. The idea is simply to do exactly what a for loop in C would do.

dirk · 2016-01-24T20:32:00Z

I'm guessing this is very likely a case of the @llvm.sadd.with.overflow.i32 intrinsic preventing the LLVM optimizer from being able to do proper constant folding, see this playground for an example. Check the LLVM IR in release mode, look for the function @_ZN8iterator... near the bottom when commenting/uncommenting lines 23 and 28. You'll see that with an unchecked addition it's able to fully constant fold the entire expression, but when the checked addition intrinsic is used it's unable to constant fold the expression.

ranma42 · 2016-01-24T21:40:49Z

It looks like it might be possible to get rid of checked_add by computing an appropriate range end. This would also be needed to fix a bug causing infinite repetition in StepBy<_, RangeFrom<_>>.
Although the example states that (0u8..).step_by(2) only iterates on all even u8 values, it will actually do so an infinite number of times: http://is.gd/EaJ6ur

bluss · 2016-01-24T22:37:49Z

I think we decided the user has to ensure that they don't take too many elements from a 0.. iterator. It overflows with debug assertions.

bluss · 2016-01-24T22:38:37Z

i.e. behavior past the overflow point is not part of the interface.

huonw · 2016-01-25T04:05:31Z

@bluss hm, while it's nice to have something to compare against for the simple cases, your ideal version is unfortunately somewhat useless: we obviously want to do what a C for loop does (that's essentially what zero cost abstractions means for things like this), but the code as written it doesn't handle wrap-around properly (e.g. (250u8..255).step_by(10)), nor does it handle negative steps.

bluss · 2016-01-25T11:58:01Z

Maybe useless to replace current step_by, yes.

It's a tall order, isn't it? Do what the for loop does, Zero overhead, Check for overflow. Computing the end up front, it sounds doable though.

For the RangeFrom iterator we evidently chose to not care for overflow (overflow is user error), that could be the model here too.

huonw · 2016-01-25T12:44:31Z

I think that making (250_u8..255).step_by(10) have infinite length would be a massive footgun, almost to the point of maliciousness on the part of us library designers. At least with RangeFrom, step_by is just mirroring the behaviour of the underlying iterator.

It seems to me that it makes most sense for some_iter.step_by(n) to be equivalent to something like

let mut count = 0;
some_iter.filter_map(|x| { if count == 0 { count = n - 1; Some(x) } else { count -= 1; None })

(So, yes, some scheme that computes the appropriate upper bound seems like it may be the best plan of attack.)

bluss · 2016-01-25T12:57:13Z

I was against having 0.. not check for overflow, but it is a solution that leads to zero overhead. We already have the 0u8.. footgun now.

But I don't agree that it has infinite length. It has a debug assertion for overflow.

huonw · 2016-01-25T13:11:20Z

Yes, the RangeFrom footgun is somewhat unfortunate, but it seems much less severe to me. If someone is using a range with an upper bound, it seems bad to explicitly disobey it, and, it seems worse to have different behaviour between two similar variations of a type.

Debug assertions makes the "real" behaviour a type a little bit of a grey area, but I think one certainly couldn't say that an overflowing version of that iterator is finite: no matter what build configuration you have, you'll never get a None back from next. I believe size_hint would have to return None as its second value.

bluss · 2016-01-25T13:23:59Z

Very fair point. The nice behavior is certainly preferable. When people pull out a while loop on the forum and discover "this has better performance than step_by", it's because they completely disregard the overflow (wraparound) case, though.

RangeFrom is neither finite nor infinite, but it ends with a bang 😉

dirk · 2016-01-25T13:52:34Z

Introducing new footguns definitely doesn't sound like a desirable thing. It seems to me that the current behavior (ie. overflow checking) is the most desirable behavior for the general use of the StepBy on a Range given that it's the safest (compared to a while loop).

That (250_u8..255).step_by(10) is both safe and optimizes pretty well (even without constant folding) is a good thing in my book.

I think it's worth asking whether we can—and whether it's worth—creating an optimized path for StepBy that can be taken when we can be sure of the safety preconditions you guise have mentioned:

The start of the range is known
The end of the range is known
The step is known to not overflow at any point along the series from the start to the end

alexcrichton · 2017-08-25T16:43:28Z

Today's reproduction still doesn't optimize

Specialize StepBy<Range(Inclusive)> Part of #51557, related to #43064, #31155 As discussed in the above issues, `step_by` optimizes very badly on ranges which is related to 1. the special casing of the first `StepBy::next()` call 2. the need to do 2 additions of `n - 1` and `1` inside the range's `next()` This PR eliminates both by overriding `next()` to always produce the current element and also step ahead by `n` elements in one go. The generated code is much better, even identical in the case of a `Range` with constant `start` and `end` where `start+step` can't overflow. Without constant bounds it's a bit longer than the manual loop. `RangeInclusive` doesn't optimize as nicely but is still much better than the original asm. Unsigned integers optimize better than signed ones for some reason. See the following two links for a comparison. [godbolt: specialization for ..](https://godbolt.org/g/haHLJr) [godbolt: specialization for ..=](https://godbolt.org/g/ewyMu6) `RangeFrom`, the only other range with an `Iterator` implementation can't be specialized like this without changing behaviour due to overflow. There is no way to save "finished-ness". The approach can not be used in general, because it would produce side effects of the underlying iterator too early. May obsolete #51435, haven't checked.

steveklabnik · 2019-09-22T14:40:58Z

Today, the feature is no longer in nightly, and we get fully optimized results for both:


playground::iterator: # @playground::iterator
# %bb.0:
	movl	$50000, %eax            # imm = 0xC350
	retq
                                        # -- End function

playground::while_: # @playground::while_
# %bb.0:
	movl	$50000, %eax            # imm = 0xC350
	retq
                                        # -- End function

closing!

huonw added A-libs I-slow Issue: Problems and improvements with respect to performance of generated code. A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. labels Jan 24, 2016

steveklabnik added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. and removed A-libs labels Mar 24, 2017

Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 24, 2017

Emerentius mentioned this issue Jun 16, 2018

Specialize StepBy<Range(Inclusive)> #51601

Merged

Emerentius mentioned this issue Jul 5, 2018

Allow Pre-Stepping in Step-By? #52065

Closed

steveklabnik closed this as completed Sep 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StepBy<_, Range<_>> optimises poorly #31155

StepBy<_, Range<_>> optimises poorly #31155

huonw commented Jan 24, 2016

bluss commented Jan 24, 2016

dirk commented Jan 24, 2016

ranma42 commented Jan 24, 2016

bluss commented Jan 24, 2016

bluss commented Jan 24, 2016

huonw commented Jan 25, 2016

bluss commented Jan 25, 2016

huonw commented Jan 25, 2016

bluss commented Jan 25, 2016

huonw commented Jan 25, 2016

bluss commented Jan 25, 2016

dirk commented Jan 25, 2016

alexcrichton commented Aug 25, 2017

steveklabnik commented Sep 22, 2019

StepBy<_, Range<_>> optimises poorly #31155

StepBy<_, Range<_>> optimises poorly #31155

Comments

huonw commented Jan 24, 2016

bluss commented Jan 24, 2016

dirk commented Jan 24, 2016

ranma42 commented Jan 24, 2016

bluss commented Jan 24, 2016

bluss commented Jan 24, 2016

huonw commented Jan 25, 2016

bluss commented Jan 25, 2016

huonw commented Jan 25, 2016

bluss commented Jan 25, 2016

huonw commented Jan 25, 2016

bluss commented Jan 25, 2016

dirk commented Jan 25, 2016

alexcrichton commented Aug 25, 2017

steveklabnik commented Sep 22, 2019