-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
🚀 float NaN handling #21
Conversation
CodSpeed Performance ReportMerging #21 Summary
Benchmarks breakdown
|
Thanks so much for tackling this, @jvdd ! Any smaller tasks I can help with? |
Gave it a shot at filling in the API values for NEON + Arm64. You can fish the exact commit out here - 728e310 Feel free to cherry-pick this into your PR if you judge it useful! |
Thx @varon, I appreciate your help! I'll first try to merge PR #23 (which does some major refactoring in terms of traits & structs) - should make the codebase a lot more flexible (separates floats from the other datatypes without any real code overhead). I'll document this tomorrow. (I fear that this merge will result in quite a lot merge conflicts with your PR - my apologies for this :/) Once PR #23 is merged, I do not plan to change anything to the traits / structs (& implementation) of ints & uints. So implementing the ARM/Aarch64 SIMD for those dtypes should then be quite safe :) |
♻️ major refactoring
What I tried in commit 07a5e66 does not work. You cant pass Related issue rust-lang/rust#52393 |
… varon-neon-nan-v3
♻️ change nan default handling behavior to SkipNa
Any action I can help with? |
Hey @varon, after I have reviewed my own code today, I believe this PR will finally be ready for merging! 🎉 If you'd like to help out, there are a couple of things you could do:
Thanks in advance for your help! |
# rstest = { version = "0.16", default-features = false} | ||
# rstest_reuse = "0.5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is something I experimented with and will use in a future PR (parameterizing the tests)
# TODO: support this | ||
# [[bench]] | ||
# name = "bench_f16_ignore_nan" | ||
# harness = false | ||
# required-features = ["half"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is currently not supported as we use the ord_transform
to provide SIMD support for the non-hardware supported f16
datatype (see #1)
data | ||
} | ||
|
||
// TODO: rename _random_long_ to _nanargminmax_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do this in a separate PR (cleaning up the benchmarks; renaming + removing unused benchmarks)
// TODO: split this up | ||
// pub trait NaNArgMinMax { | ||
// fn nanargminmax(&self) -> (usize, usize); | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is for a future pull request
src/lib.rs
Outdated
// Scalar is faster for 64-bit numbers | ||
// TODO: double check this (observed different things for new float implementation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The /benches/results
indicate that for most CPUs this is indeed faster! => will look into this in a separate PR
); | ||
(minmax_tuple.0, minmax_tuple.2) | ||
} | ||
// TODO: previously we had dedicated non x86_64 code for f16 (see below) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will revisit this in a separate PR
Will review shortly! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🙃
Excellent timing @varon! I just finished my review, so this is the perfect moment to jump in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some general comments - overall it seems really solid.
As a singular point, I would try to make the relations between the modules and types really clear in the code. For instance, in each of the data-type implementations, refer to the trait they implement, then in that trait explain how it fits into the overall packaging/system, and how it's generated, etc.
That will help users to navigate the codebase significantly easier, because there's useful breadcrumbs explaining where to look to go up/down the abstraction hierarchy.
The only other structural comment I would suggest is trying to migrate the test code out of the implementation classes. It makes them seem considerably more intimidating than they actually are. Ultimately where that's placed is opinion/rust best practices, which I can't say I'm familiar with, but it was quite a surprise to come across them there.
Lastly, as a total aside, have you looked at doing a CUDA/RocM implementation here? I suspect all of the copying back/forth of the results would probably be slower than this, but maybe with the right subdivision algorithm and keeping the data GPU-side, it could be possible to only copy that over once.
However, in the case of Metal, especially on the new apple chips, they're using unified memory - this would mean that there's no requirements to copy that over, and likely would be dramatically faster than the NEON-based instructions.
for _ in 0..arr.len() / LANE_SIZE - 1 { | ||
// Increment the index | ||
new_index = Self::_mm_add(new_index, Self::INDEX_INCREMENT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance question:
How does this look if we iterate on each of these arrays in separate for loops? Would that not increase our cache hits? We can't make assumptions about the locality of the data, but doing it this way means we're operating on what likely could be data with better locality.
This is for consideration, but if you do know the answer/have tried it, maybe throw in a comment explaining why that approach isn't faster than this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very relevant question!
To assure that we are on the same page - you are suggesting to instead of iterating over the entire array in a single loop, it might be more efficient to perform some sort of "chunked" iteration (e.g. 4*LANE_SIZE
) in an inner loop?
I can see how something like this can potentially decrease cache misses (as smaller "chunks" that fully fit in cache can be reused in the inner loop).
Although I did not explore exactly what you described here, I did try loop unrolling - with no avail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, let me clarify.
In this loop, we're reading from 3 different arrays at once for every step, with one for loop.
What I am suggesting may be faster is to iterate over each array separately (i.e. use 2x separate for-loops), as this would maximise our chances of getting cache & prefetch hits.
for _ in 0..arr.len() / LANE_SIZE - 1 {
// do stuff on index high
...
}
for _ in 0..arr.len() / LANE_SIZE - 1 {
// do stuff on index low
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see!
Guess assessing the potential performance gain of this can be done rather quickly - I'll analyze the cache misses some time next week :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I further thought about this and quickly ran some experiments (benchmarks + some basic perf
analysis). I did not observe any performance gains (only consistent performance degrations).
My 5 cents regarding this:
- the code iterates over just one (and the same) array
- all the other variables are SIMD registers (that should be present in memory?)
=> splitting the code up in 2 for loops will thus read twice the same data (which was already the bottleneck of this code) + there is some additional overhead as the index increment is now also performed twice (instead of once in the single loop implementation).
Also - Would love if you can drop me an email so I can get in touch outside of GitHub and hopefully get some closer collaboration/easier contact in the future. You can reach me at varon-github@outlook.com. |
Thank you @varon for your feedback on the pull request! Here is an answer addressing your comments:
Once again, thank you for your feedback and suggestions! It was quite fun implementing this with your support / feedback :) I'll send you an email shortly. |
Handle NaNs, closes #16
✔️ no behavior changes in this PR (except for
f16
)Previous (and also current) behavior for floats:
.argminmax
: ignores NaNs (while being even faster for floats)=> Only "downside" - if data contains ONLY NaNs and/or +inf/-inf this will return 0
(I believe we can accept this unexpected for now - seems like a very uncommon use case)
.nanargminmax
: (new function 🎉) returns the index of the first NaN value (instead of ignoring it)To realize this functionality, we use the transformation as detailed in 💪 handle NaNs #16 & explored in 🚧 POC - support NaNs for SSE & AVX2 f32 #18
❗ for
f16
we do not have an IgnoreNaN implementation yet. (previously.argminmax
forf16
corresponded to the ReturnNan case - as we use theord_transform
to efficiently handle non-hardware supportedf16
).Changing the "architecture":
SIMDCore
traitsIgnoreNaN
andReturnNan
variant for floats theSIMDInstructionSet structs
IgnoreNaN
ReturnNaN
Changing the default behavior
switch from IgnoreNan to ReturnNan for argminmax=> we use IgnoreNan as default - see ♻️ change nan default handling behavior to SkipNa #28
simd_f16.rs
should implementSIMDInstructionSet
and not theIgnoreNan
structsFloatIgnoreNaN
)-> first assure that we currently have no regressions - is still the case! - if so, change the
FloatIgnoreNan
benches toFloatReturnNan
(will most likely result in some regressions) and move theFloatIgnoreNan
benches to dedicated bench file.ArgMinMax
traitf16
Minor TODOs during this (waaay too large) PR:
Overview of the new architecture
SIMDInstructionSet
struct (e.g.,AVX2
) its argminmax is "return NaN" (e.g.simd_f32.rs
)AVX2FloatIgnoreNaN
)