Skip to content

Vec::retain() is significantly slower than into_iter().filter().collect() #91497

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
kageru opened this issue Dec 3, 2021 · 8 comments · Fixed by #91527
Closed

Vec::retain() is significantly slower than into_iter().filter().collect() #91497

kageru opened this issue Dec 3, 2021 · 8 comments · Fixed by #91527
Assignees
Labels
A-codegen Area: Code generation A-collections Area: `std::collections` A-iterators Area: Iterators C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@kageru
Copy link

kageru commented Dec 3, 2021

I noticed today that using Vec::retain is much slower than filtering and allocating a new Vector.
I realize the former is probably more memory efficient, but I still found it surprising that it would be that much slower (or really slower at all).

When testing with this code:

#![feature(test)]
extern crate test;

fn main() {
    let xs: Vec<i32> = (0..1000).collect();
    assert_eq!(even_with_retain(xs.clone()), even_with_filter(xs.clone()));
}

pub fn even_with_retain(mut xs: Vec<i32>) -> Vec<i32> {
    xs.retain(|x| x & 1 == 0);
    xs
}

pub fn even_with_filter(xs: Vec<i32>) -> Vec<i32> {
    xs.into_iter().filter(|x| x & 1 == 0).collect()
}

#[bench]
fn bench_retain(b: &mut test::Bencher) {
    let xs: Vec<i32> = (0..1000).collect();
    b.iter(|| assert_eq!(even_with_retain(test::black_box(xs.clone())).len(), 500));
}

#[bench]
fn bench_filter_collect(b: &mut test::Bencher) {
    let xs: Vec<i32> = (0..1000).collect();
    b.iter(|| assert_eq!(even_with_filter(test::black_box(xs.clone())).len(), 500));
}

on 1.59.0-nightly (48a5999 2021-12-01), I get these benchmark results:

test bench_filter_collect ... bench:         383 ns/iter (+/- 4)
test bench_retain         ... bench:       1,891 ns/iter (+/- 17)

on a Ryzen 5900X running Linux. Testing on a different machine (Xeon E3-1271 v3), I get similar numbers:

test bench_filter_collect ... bench:         498 ns/iter (+/- 29)
test bench_retain         ... bench:       1,800 ns/iter (+/- 44)

Vec::retain seemed like the obvious choice to me, so it being slower is either a bug or should be documented somewhere.

Godbolt

@BluBb-mADe
Copy link

BluBb-mADe commented Dec 3, 2021

i7 8700k windows 10
default configuration:

test bench_filter_collect ... bench:       1,038 ns/iter (+/- 48)
test bench_retain         ... bench:       1,316 ns/iter (+/- 19)

with lto = true

test bench_filter_collect ... bench:         360 ns/iter (+/- 14)
test bench_retain         ... bench:       1,335 ns/iter (+/- 46)

rustc 1.59.0-nightly (acbe444)

@kageru
Copy link
Author

kageru commented Dec 3, 2021

Setting lto = true makes the disparity even bigger for me:

test bench_filter_collect ... bench:         199 ns/iter (+/- 6)
test bench_retain         ... bench:       1,878 ns/iter (+/- 26)

@inquisitivecrystal inquisitivecrystal added A-codegen Area: Code generation A-collections Area: `std::collections` A-iterators Area: Iterators C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Dec 3, 2021
@the8472
Copy link
Member

the8472 commented Dec 3, 2021

It looks like retain is optimized (#81126) for small or large probabilities of keeping elements, i.e. long runs of elements that will be either kept or deleted. into_iter().filter().collect() on the other hand just reads elements one by one, filters them and writes out those that passed, i.e. it does not care about runs.

Since the test-case here drops every second element it's the worst possible case for retain, possibly even worse than random selection (since random clusters may have runs).

Edit: Hrm, the benchmarks look like they're optimizing for high/low retain probabilities; but the code doesn't, it still moves one element at a time when previous ones have been dropped.

@hkratz
Copy link
Contributor

hkratz commented Dec 3, 2021

There is something weird going, because if you put a copy of the current retain() implementation in the source file it is much faster:

test bench_filter_collect ... bench:         307 ns/iter (+/- 12)
test bench_my_retain      ... bench:         431 ns/iter (+/- 3)
test bench_retain         ... bench:       1,999 ns/iter (+/- 7)

Godbolt

Edit: Doh, but at least now we now that #88060 caused some major slowdown.

@the8472
Copy link
Member

the8472 commented Dec 3, 2021

You're not copying the most recent version of retain. process_one has gained a const bool parameter in #88060 and thus gets instantiated in two flavors.

#[inline(always)]
fn process_one<F, T, A: Allocator, const DELETED: bool>(
f: &mut F,
g: &mut BackshiftOnDrop<'_, T, A>,
) -> bool
where
F: FnMut(&mut T) -> bool,
{
// SAFETY: Unchecked element must be valid.
let cur = unsafe { &mut *g.v.as_mut_ptr().add(g.processed_len) };
if !f(cur) {
// Advance early to avoid double drop if `drop_in_place` panicked.
g.processed_len += 1;
g.deleted_cnt += 1;
// SAFETY: We never touch this element again after dropped.
unsafe { ptr::drop_in_place(cur) };
// We already advanced the counter.
return false;
}
if DELETED {
// SAFETY: `deleted_cnt` > 0, so the hole slot must not overlap with current element.
// We use copy for move, and never touch this element again.
unsafe {
let hole_slot = g.v.as_mut_ptr().add(g.processed_len - g.deleted_cnt);
ptr::copy_nonoverlapping(cur, hole_slot, 1);
}
}
g.processed_len += 1;
return true;
}
// Stage 1: Nothing was deleted.
while g.processed_len != original_len {
if !process_one::<F, T, A, false>(&mut f, &mut g) {
break;
}
}
// Stage 2: Some elements were deleted.
while g.processed_len != original_len {
process_one::<F, T, A, true>(&mut f, &mut g);
}

@BluBb-mADe
Copy link

BluBb-mADe commented Dec 3, 2021

I tried a few scenarios that should heavily favor retain but the effect remains.
|x| *x == 1

test bench_filter_collect ... bench:         284 ns/iter (+/- 18)
test bench_retain         ... bench:       1,338 ns/iter (+/- 55)

|x| *x != 1

test bench_filter_collect ... bench:         413 ns/iter (+/- 130)
test bench_retain         ... bench:       1,369 ns/iter (+/- 28)

|x| *x == 500

test bench_filter_collect ... bench:         289 ns/iter (+/- 15)
test bench_retain         ... bench:       1,332 ns/iter (+/- 54)

|x| *x >= 500

test bench_filter_collect ... bench:         439 ns/iter (+/- 80)
test bench_retain         ... bench:       1,379 ns/iter (+/- 185)

i7 8700k w10 rustc 1.59.0-nightly (acbe444)

@the8472
Copy link
Member

the8472 commented Dec 4, 2021

I'll take a stab at this. @rustbot claim

@the8472
Copy link
Member

the8472 commented Dec 4, 2021

Submitted #91527, if anyone wants to confirm my benchmark results.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A-codegen Area: Code generation A-collections Area: `std::collections` A-iterators Area: Iterators C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants