-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Optimized the performance of float object #218
base: develop
Are you sure you want to change the base?
Conversation
Hello and thank you for the contribution! These changes to argument passing are quite subtle and sensitive. I'll have to double check things on my end and compare the generated assembly, etc. As a result, it will take me some time to review things. I anticipate that I'll have time to look into this in early July. I'll get back to you then. In the meantime, I just wanted to give some general context. Your analysis makes a lot of sense, but there's a few things at play that are worth considering. Float32 arithmetic on ARM uses NEON SIMD registers. This allows us to pass vector/quat/mask values by value in register and return them by register as well. For aggregate types (e.g. qvv, matrix), things are a bit more complicated. For clang, a few aggregate types (depending on size/internals) can be passed by value in register BUT aggregate values are not returned by register (unlike with In contrast, float64 uses scalar math on ARM (for the time being, it is on my roadmap to use SIMD registers for XY and ZW in pairs like we do with SSE). Using scalar math causes the generated assembly to be much larger as many more instructions are required. This has an adverse effect on inlining as large functions don't inline as well. However, despite the large number of instructions, most of them can execute independently as SIMD lanes are often independent. This means that with float64, there are far fewer bubbles in the execution stream and there is far more work to execute. As a result, with modern out-of-order CPUs, they can be kept well fed with few to no stalls in execution. And so, even if each instruction is more expensive, the gap in execution cost between float32 and float64 might not be as large as one might expect in practice. Note that using XY and ZW in pairs will help reduce the assembly size, improving inlining and performance but because both pairs are often independent, the rest of the analysis remains consistent. In the end, whether a function inlines or not is often the biggest performance impact at play and matrix math often uses many registers and many instructions, hindering inlining. Crucially, whether a function inlines or not is also determined by where it is called and so the measurements depend heavily on the sort of code that you have. Are you at liberty to share what the calling code looks like and which RTM functions are involved in your measurements or did you do broad measurements over a large and complex piece? Cheers, |
Hi Nicholas,
In the _simd_xxxx functions within the TMatrix4 class, all matrix operations are implemented internally within our project. Some more specific functions are initially implemented as follows:
Here, we encounter the issue of parameter copy passing and the return value problem you mentioned earlier. We have since made changes to such function calls:
The main change was to modify the parameter passing of the matrix to be by reference. The performance after these modifications has already shown significant improvement. |
One more thing is that our project's C++ version is quite high. This PR did not handle compatibility with C++11 well, so I need to make some adjustments. |
788f574
to
a995282
Compare
Thank you for the clarification. I will see if I can add a benchmark test based on your sample and see if I can reproduce locally. What kind of processors/android device are you seeing this on? I'll take a look at this in the next 2 weeks. |
The processor is snapdragon-xr2-gen2 |
I encountered an issue with unit tests. The configurations |
Yes, those failures are probably due to a known compiler/toolchain issue, see this PR for details: #212 I wouldn't worry about it for now. I'm waiting for github to update the image with a newer VS version that has a fixed clang version. Sadly, for reasons unknown, RTM ends up triggering a LOT of compiler bugs in various toolchains. Over the years, I've found dozens of bugs (and reported many) in msvc, gcc, and clang. Thankfully, it has gotten better over the years. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of good stuff in here:
- I like the idea of moving the vector mix details into its own header
- I like the idea of using template specialization, I've ran into a lot of codegen issues with the existing function due to relying on
constexpr
branches - I like the idea of using
std::enable_if
to validate and branch variants
Just needs a bit of cleaning up and minor tweaks to bring back the missing AVX/NEON specializations for vector mix, see notes.
I'll profile the matrix argument passing change in the coming days and get back to you.
@@ -0,0 +1,193 @@ | |||
#pragma once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to move this into its own header.
As a convention, all similar headers have the suffix .impl.h
The header is also missing the MIT license information, see other headers as example.
#include "rtm/types.h" | ||
#include "rtm/impl/compiler_utils.h" | ||
#include "rtm/scalarf.h" | ||
#include "rtm/scalard.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this header required?
|
||
#if defined(RTM_SSE2_INTRINSICS) || defined(RTM_AVX_INTRINSICS) | ||
|
||
#define SHUFFLE_MASK(a0,a1,b2,b3) ( (a0) | ((a1)<<2) | ((b2)<<4) | ((b3)<<6) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a standardized macro for this, see _MM_SHUFFLE
in Intel intrinsic documentation.
// Float swizzle | ||
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// | ||
template<int index0, int index1, int index2, int index3> | ||
RTM_FORCE_INLINE vector4f vector_swizzle_impl(const vector4f& vec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is an implementation detail, we should move all of this into the rtm_impl
namespace
{ | ||
return _mm_shuffle_ps(vec, vec, SHUFFLE_MASK(index0, index1, index2, index3)); | ||
} | ||
template<> RTM_FORCE_INLINE vector4f vector_swizzle_impl<0, 1, 2, 3>(const vector4f& vec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of using template specialization for this
////////////////////////////////////////////////////////////////////////// | ||
// Writes a vector4 to aligned memory. | ||
////////////////////////////////////////////////////////////////////////// | ||
RTM_DISABLE_SECURITY_COOKIE_CHECK RTM_FORCE_INLINE void RTM_SIMD_CALL vector_store_aligned(vector4f_arg0 input, float* output) RTM_NO_EXCEPT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem, vector4d version is missing along with unit test
output[0] = vector_get_x(input); | ||
output[1] = vector_get_y(input); | ||
#elif defined(RTM_NEON_INTRINSICS) | ||
vst1_f32(output, *(float32x2_t*)&input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reinterpret_cast isn't safe, would be better to getvget_low_f32
output[0] = vector_get_x(input); | ||
output[1] = vector_get_y(input); | ||
output[2] = vector_get_z(input); | ||
#elif defined(RTM_NEON_INTRINSICS) | ||
vst1_f32(output, *(float32x2_t*)&input); | ||
vst1q_lane_f32(((float32_t*)output) + 2, input, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem about vget_low_f32
and here the cast for output
isn't necessary
////////////////////////////////////////////////////////////////////////// | ||
// 3D cross product: lhs x rhs | ||
////////////////////////////////////////////////////////////////////////// | ||
RTM_DISABLE_SECURITY_COOKIE_CHECK RTM_FORCE_INLINE vector4f RTM_SIMD_CALL vector_cross3(vector4f_arg0 lhs, vector4f_arg1 rhs) RTM_NO_EXCEPT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why move this?
////////////////////////////////////////////////////////////////////////// | ||
template <int index0, int index1, int index2, int index3, | ||
typename std::enable_if<(index0 < 4 && index1 < 4 && index2 >= 4 && index3 >= 4), int>::type = 0> | ||
vector4f vector_swizzle_with_index(vector4f_arg0 input0, vector4f_arg1 input1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot rename a function without retaining the old one and deprecating it. If this is meant to replace the old one, why give it a new name?
Also, here we are missing the other parts of the function signature (e.g. RTM_SIMD_CALL etc), see original function signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it an implementation detail? In that case, it belongs in the rtm_impl
namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition of this new function wasn't my intention. In my project, I use if constexpr
to perform compile-time branch evaluation. However, RTM needs to ensure normal operation under C++11, so I added the with_index
function to handle the evaluation of template parameters that have been converted to int. Are you suggesting that instead of adding the with_index
function, I should modify vector_mix
itself? Or should I move the with_index
function to the rmt_impl
namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the confusion, if this is a new function that is an implementation details (and not meant for end users) then it would belong in the rtm_impl
namespace like the others. Functions in that namespace can be easily changed in any release as I don't maintain ABI compatibility for implementation details.
The lack of if constexpr
is a pain, I agree but sadly many tool chains still commonly used don't fully implement later C++ versions. However, RTM provides rtm_impl::static_condition
that allows similar usage (e.g. see old vector_mix
impl). I'll let you decide which pattern to use for this, but I'd ask that if you decide to keep the with_index function, please move it to the rtm_impl namespace.
I added a benchmark to profile argument passing for matrix3x3f here: #219 The results are as follow for me:
This is in line with my expectations and confirms why I choose to pass as many aggregates by register as possible:
I will see with my Pixel 7 android phone if I can replicate the results when I get the chance this week. I suspect that the results will be consistent. It may be worthwhile digging further into your benchmark and how you measured the difference. Did you only measure in a micro-benchmark or did you also observe an improvement in a non-synthetic use case (e.g. actual application)? Micro-benchmarks only offer a narrow view and can sometimes not capture the actual cost/benefit of an approach vs another. How did the assembly look before/after the change? It may also be worthwhile trying to run my benchmark on your device to see if you can reproduce my findings. From there, perhaps you may be able to tweak it to showcase the results you've seen on your end. |
The CI also ran my benchmark on x64 SSE2 with clang 14 and we can see there that the calling convention not returning aggregates by register indeed causes performance issues:
I'll have to see what the generated assembly looks like there. Later this/next week I'll give that a try on my Zen2 desktop. |
Our test results were also analyzed based on benchmark reports under Android, but the code in the benchmark is slightly more complex. The conclusion of the comparison was initially surprising: SIMD performance for double was actually better than for float, which was counterintuitive. We eventually found that there was a difference in parameter passing between the two, so we made some modifications to the parameter passing. The improved float performance indeed saw a significant boost. Regarding the code in the PR, there are many non-standard parts, and I will modify them one by one. Next week, I will also run your benchmark code on my device to see how it differs from my benchmark. |
Here are some more notes profiling argument passing on my Zen2 desktop. With VS2022 SSE2 and
This is because I originally opted to not pass the second argument by value. This may appear sub-optimal in this synthetic benchmark but in practice, it depends a lot on the function signature. With VS2022 SSE2 without
Here, surprisingly, we can see that passing by value is slower than by reference. It is slower because with the default calling convention, vectors passed by value are written on the stack and thus passed by reference underneath the hood. Current is also slower. Here, current ends up returning the matrix by value on the stack while arguments are passed by reference, and it must be copied to the actual variable upon return. This is why it is slower than by reference where the return address is provided by an argument. With VS2022 SSE2 and Clang 17, the results are as follow:
The numbers here are slightly different but consistent with the SSE2 non-vectorcall ones. The assembly is slightly different but the end result is the same for all 3. With my Pixel 7, the results are as follow:
Here as well, the numbers are consistent with my M1 laptop: passing and returning by value is faster than by reference. Overall, it's tricky. What is optimal for NEON and vectorcall isn't optimal elsewhere. |
Thank you very much for sharing the data. It seems that maintaining the original method of passing parameters by value would better meet the requirements of the rtm library. I have also extracted the business-related content from my local project and conducted benchmark tests specifically on passing parameters by value versus passing parameters by reference . The results show that the performance of both methods is almost identical, and there is no significant advantage of passing parameters by reference over passing parameters by value. I apologize for the premature and incorrect conclusion I made earlier. Once again, thank you for your professional response, which has been very beneficial to me. I will spend some more time analyzing the actual cause of the issue in my project. |
Thank you for taking the time to dig deeper :) Writing synthetics benchmarks is as much art as it is science. It is not trivial, especially for simple low level functions with few instructions like this. It is very easy to end up measuring side effects that you did not intend to measure or accounted for. I've made many mistakes in the past when writing them and in the end, sometimes, it isn't possible to capture the true impact that would be seen in real world usage. I've seen many cases where synthetic benchmarks show a win for something vs another which turns out to be the opposite in real code due to inlining and scheduling (of such small low level things). As an example, I spent at least 3-6 months figuring out how to properly benchmark animation decompression: https://github.com/nfrechette/acl/blob/ac1ea98938eef4d4bd4c9742a059cb886cad19d5/tools/acl_decompressor/sources/benchmark.cpp#L50 In the end, sometimes it isn't possible to write a function that is optimal on every architecture or every usage scenarios. RTM aims to provide sane defaults where possible, but it is expected that if you need specialized versions for your needs (due to the unique circumstances of your code) you'll write them outside RTM. For example, sometimes you need a matrix function inlined in a tight very hot loop even though in general you might not want to always inline it due to code size bloat. Another is with my animation compression library where I need stable versions of quaternion functions that won't change as I update RTM to ensure determinism over time. That being said, if you think something is more widely useful and should belong within RTM, feel free to submit a PR as you did and we can discuss and consider it :) |
Out of curiosity, I also added the same benchmark for matrix3x3d to see.
Doubles are slower but as you've found, when passing by reference, they are almost as fast even though they use scalar arithmetic instead of using SIMD pairs. With SIMD pairs, perhaps double could get faster under this synthetic benchmark. However, passing by value (which is currently the default for the first matrix3x3d argument) is quite a bit slower. It appears that with doubles, the aggregate structures are not passed by register as arguments nor are they returned by register as a return value. This means that we have to roundtrip to the stack every time :( I'll add a note to double check this with the NEON documentation as it appears that this could be improved. That might also change down the road once I optimize doubles to use SIMD pairs. Thanks to your input, we now have benchmarks to track this :) |
Hello,
Thank you for taking the time to review my pull request. Below is a brief overview of the changes and enhancements I've made. Please let me know if there are any questions or further clarifications needed.
Two PRs will be submitted in total; this is the first one.
PR1
This PR mainly focuses on optimizations for float.
In the initial tests, a peculiar result was observed: when testing Matrix on Android, the execution time for float was longer than that for double, which is counterintuitive. Therefore, an analysis was conducted on this part. The first step was to compare the instruction counts of the two test programs, revealing that the instruction count for the float test program was 712,287,424, while that for the double test program was 664,675,474. RTM uses pure C methods to implement double, yet surprisingly, the instruction count for float NEON was even higher than that for double, leading to further analysis.
By decompiling the instruction code of the double test code, it was found that the compiler, after optimization, inserted a large number of NEON intrinsics, significantly accelerating performance. The reasons for this optimization include:
As a result, double performed much better than expected, but this only indicates that the compiler's optimization for double code is more aggressive, not that double is inherently faster than float. There must be areas in the float implementation that are more performance-costly, hence the following two optimizations were made:
1. Changing matrix parameter passing from value to reference
From the decompiled code of double, it can be seen that the compiler eventually inlines the function, expanding most of the code into a single function call. This disrupts the expected function call stack distribution, rendering RTM's designed argument transmission strategy ineffective. For most function parameters, the value copy method (argx) is used for passing, which inadvertently increases many copy operations. Under the Android ARM64 architecture, the definition of types is as follows:
The settings for passing values by reference differ between float and double., which is one of the reasons for the slower speed of float.
By changing the matrix type parameter to reference passing, the test speed under float showed a significant improvement.
2. Modification of the vector_mix function
Compared to the conventional shuffle() implementation, RTM's vector_mix() is relatively special, allowing selection at any element position between two vectors, while conventional shuffle() implementations usually have the first two elements from the first vector and the last two from the second vector. This makes RTM's vector_mix() difficult to implement with simple instructions. However, we eventually made some optimizations based on compile-time behavior. The float version of vector_mix() can use __builtin_shufflevector() when compiled with the clang compiler, achieving maximum performance. For other platforms, we try to rely on compile-time behavior for acceleration.