-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Missed optimization/perf oddity with allocations #128854
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
I know this is a "minimal reduction" but is there an example where this impacts actual programs? |
For reference, the actual cause, as far as I can tell, is this line in the // Make sure we don't accidentally allow omitting the allocator shim in
// stable code until it is actually stabilized.
core::ptr::read_volatile(&__rust_no_alloc_shim_is_unstable); It, of course, can't be optimized out because it's volatile. That line doesn't appear in |
Just noticed a helpful rundown of the cause, why, and perf potential if fixed (@Kobzol's perf run) in this zulip conversation |
Would you count this as fixed if we make the codegen match by regressing the good case? #130497 |
…=<try> read_volatile __rust_no_alloc_shim_is_unstable in alloc_zeroed rust-lang#128854 (comment) r? `@ghost`
…=bjorn3 read_volatile __rust_no_alloc_shim_is_unstable in alloc_zeroed It was pointed out in rust-lang#128854 (comment) that the magic volatile read was probably missing from `alloc_zeroed`. I can't find any mention of `alloc_zeroed` on rust-lang#86844, so it looks like this was just missed initially.
Bold move @saethlin 😆. I've updated bug description/repro. Y'all do whatever you want with this, just an observation I made a while back looking into Mojo's claims being faster than Rust. |
detailsI am prototyping on a design for DataFusion scalar function vectorization without boilerplateL apache/datafusion#12635 Benchmarking the code below shows that the second function is 6x slower than the first on my machine. // Simple Result type. The Result wrappers get optimized away nicely
type Result<T> = std::result::Result<T, String>;
fn simple_sum(a: i32, b: i32, c: i32, d: i32) -> Result<i32> {
Ok(a + b + c + d)
}
fn curried_sum(a: i32, b: i32, c: i32, d: i32) -> Result<i32> {
// the arithemtics gets inlined nicely
Ok(fn_fn_fn_fn(a)?(b)?(c)?(d)?)
}
fn fn_fn_fn_fn(a: i32) -> Result<Box<dyn Fn(i32) -> Result<Box<dyn Fn(i32) -> Result<Box<dyn Fn(i32) -> Result<i32>>>>>>>
{
Ok(Box::new(move |b| Ok(Box::new(move |c| Ok(Box::new(move |d| Ok(a + b + c + d)))))))
} These functional-style functions are going to be used via template functions. The odd syntax is really important to make the templating work. I.e. it's easy to support no lazy eval (no |
I definitely see a different instruction sequence that is fixed by removing the volatile read, but on x86_64 these microbenchmark to the same throughput. So this is a nice use case and I personally hate the |
@saethlin thank you for your response! Details in https://gist.github.com/findepi/89497d13a3a249a1d2d1b6d7c2f8b927 Footnotes
|
Uh oh!
There was an error while loading. Please reload this page.
Consider the following minimized example:
Expected generated output (Rust 1.81.0):
Actual output (Rust Nightly):
Godbolt: https://www.godbolt.org/z/x458Pv8P5
The text was updated successfully, but these errors were encountered: