Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Optimized the performance of float object #218

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

daliziql
Copy link

Hello,
Thank you for taking the time to review my pull request. Below is a brief overview of the changes and enhancements I've made. Please let me know if there are any questions or further clarifications needed.

Two PRs will be submitted in total; this is the first one.

PR1

This PR mainly focuses on optimizations for float.

In the initial tests, a peculiar result was observed: when testing Matrix on Android, the execution time for float was longer than that for double, which is counterintuitive. Therefore, an analysis was conducted on this part. The first step was to compare the instruction counts of the two test programs, revealing that the instruction count for the float test program was 712,287,424, while that for the double test program was 664,675,474. RTM uses pure C methods to implement double, yet surprisingly, the instruction count for float NEON was even higher than that for double, leading to further analysis.
By decompiling the instruction code of the double test code, it was found that the compiler, after optimization, inserted a large number of NEON intrinsics, significantly accelerating performance. The reasons for this optimization include:

  1. Using 16-byte alignment
  2. Extensive use of RTM_FORCE_INLINE for forced inline expansion

As a result, double performed much better than expected, but this only indicates that the compiler's optimization for double code is more aggressive, not that double is inherently faster than float. There must be areas in the float implementation that are more performance-costly, hence the following two optimizations were made:

1. Changing matrix parameter passing from value to reference
From the decompiled code of double, it can be seen that the compiler eventually inlines the function, expanding most of the code into a single function call. This disrupts the expected function call stack distribution, rendering RTM's designed argument transmission strategy ineffective. For most function parameters, the value copy method (argx) is used for passing, which inadvertently increases many copy operations. Under the Android ARM64 architecture, the definition of types is as follows:

using matrix3x3f_arg0 = const matrix3x3f;
using matrix3x3f_arg1 = const matrix3x3f;
using matrix3x3f_argn = const matrix3x3f&;

using matrix3x3d_arg0 = const matrix3x3d;
using matrix3x3d_arg1 = const matrix3x3d&;
using matrix3x3d_argn = const matrix3x3d&;

using matrix3x4f_arg0 = const matrix3x4f;
using matrix3x4f_arg1 = const matrix3x4f;
using matrix3x4f_argn = const matrix3x4f&;

using matrix3x4d_arg0 = const matrix3x4d;
using matrix3x4d_arg1 = const matrix3x4d&;
using matrix3x4d_argn = const matrix3x4d&;

using matrix4x4f_arg0 = const matrix4x4f;
using matrix4x4f_arg1 = const matrix4x4f;
using matrix4x4f_argn = const matrix4x4f&;

using matrix4x4d_arg0 = const matrix4x4d;
using matrix4x4d_arg1 = const matrix4x4d&;
using matrix4x4d_argn = const matrix4x4d&;

The settings for passing values by reference differ between float and double., which is one of the reasons for the slower speed of float.
By changing the matrix type parameter to reference passing, the test speed under float showed a significant improvement.

2. Modification of the vector_mix function
Compared to the conventional shuffle() implementation, RTM's vector_mix() is relatively special, allowing selection at any element position between two vectors, while conventional shuffle() implementations usually have the first two elements from the first vector and the last two from the second vector. This makes RTM's vector_mix() difficult to implement with simple instructions. However, we eventually made some optimizations based on compile-time behavior. The float version of vector_mix() can use __builtin_shufflevector() when compiled with the clang compiler, achieving maximum performance. For other platforms, we try to rely on compile-time behavior for acceleration.

template<mix4 comp0, mix4 comp1, mix4 comp2, mix4 comp3>
vector4f RTM_SIMD_CALL vector_mix(vector4f_arg0 input0, vector4f_arg1 input1) RTM_NO_EXCEPT
{
        constexpr int index0 = (int)comp0;
        constexpr int index1 = (int)comp1;
        constexpr int index2 = (int)comp2;
        constexpr int index3 = (int)comp3;
#if defined(__clang__)
        return __builtin_shufflevector(input0, input1, index0, index1, index2, index3);
#else
        if constexpr (index0 < 4 && index1 < 4 && index2 >= 4 && index3 >= 4) {
                return vector_shuffle(input0, input1, index0, index1, index2 - 4, index3 - 4);
        }
        else if constexpr(index0 < 4 && index1 < 4 && index2 < 4 && index3 < 4) {
                //no input1 use here
                return vector_swizzle(input0, index0, index1, index2, index3);
        }
        else if constexpr(index0 >=4 && index1 >=4 && index2 >=4 && index3 >=4) {
                //no input0 use here
                return vector_swizzle(input1, index0 - 4, index1 - 4, index2 - 4, index3 -4);
        }else {

                float combine_arr[8];
                vector_store(input0, combine_arr);
                vector_store(input1, combine_arr + 4);
                return vector_set(combine_arr[index0], combine_arr[index1], combine_arr[index2], combine_arr[index3]);
        }
#endif
}

@CLAassistant
Copy link

CLAassistant commented Jun 21, 2024

CLA assistant check
All committers have signed the CLA.

@nfrechette
Copy link
Owner

Hello and thank you for the contribution!
I apologize for the late reply, I am just coming back from a trip abroad.

These changes to argument passing are quite subtle and sensitive. I'll have to double check things on my end and compare the generated assembly, etc. As a result, it will take me some time to review things. I anticipate that I'll have time to look into this in early July. I'll get back to you then.

In the meantime, I just wanted to give some general context. Your analysis makes a lot of sense, but there's a few things at play that are worth considering.

Float32 arithmetic on ARM uses NEON SIMD registers. This allows us to pass vector/quat/mask values by value in register and return them by register as well. For aggregate types (e.g. qvv, matrix), things are a bit more complicated. For clang, a few aggregate types (depending on size/internals) can be passed by value in register BUT aggregate values are not returned by register (unlike with __vectorcall with MSVC). When functions inline, this distinction doesn't really matter but when they don't it comes into play as it forces round-trips to stack memory (also called a load-hit-store). Typically, moderns processors handle this case quite well through store-forwarding but a few extra cycles on the load are still incurred. As a result of this, code that uses float32 ends up being quite dense with many instructions dependent on one another which can introduce bubbles in the execution and extra latency from store-forwarding further increases latency.

In contrast, float64 uses scalar math on ARM (for the time being, it is on my roadmap to use SIMD registers for XY and ZW in pairs like we do with SSE). Using scalar math causes the generated assembly to be much larger as many more instructions are required. This has an adverse effect on inlining as large functions don't inline as well. However, despite the large number of instructions, most of them can execute independently as SIMD lanes are often independent. This means that with float64, there are far fewer bubbles in the execution stream and there is far more work to execute. As a result, with modern out-of-order CPUs, they can be kept well fed with few to no stalls in execution. And so, even if each instruction is more expensive, the gap in execution cost between float32 and float64 might not be as large as one might expect in practice. Note that using XY and ZW in pairs will help reduce the assembly size, improving inlining and performance but because both pairs are often independent, the rest of the analysis remains consistent.

In the end, whether a function inlines or not is often the biggest performance impact at play and matrix math often uses many registers and many instructions, hindering inlining. Crucially, whether a function inlines or not is also determined by where it is called and so the measurements depend heavily on the sort of code that you have. Are you at liberty to share what the calling code looks like and which RTM functions are involved in your measurements or did you do broad measurements over a large and complex piece?

Cheers,
Nicholas

@daliziql
Copy link
Author

Hi Nicholas,
Thank you for your reply.
The main content of the test involves matrix composition, transformation, and inversion operations. Below is the general framework of the test code:

//----------------------------------------------------------------------------------------
// matrix compose and transform
//----------------------------------------------------------------------------------------
template<typename FloatType, typename CalcPolicy>
static void DoMatrixComposeImpl(benchmark::State& state) {
    using Vector4Array = std::vector<TSimdVector<FloatType>>;
    using QuaternionArray = std::vector<TQuaternion<FloatType>>;
    
    Vector4Array   translationArray;
    Vector4Array   scaleArray;
    QuaternionArray    quatArray;
    Vector4Array   orignalArray;
    Vector4Array    resultArray;

    ...

    for (int i = 0; i < kMathCalcCount; i++) {
       translationArray[i] = TSimdVector<FloatType>(MathTool::rangeRandom(0.0, 1000.0), MathTool::rangeRandom(0.0, 1000.0), MathTool::rangeRandom(0.0, 1000.0), 1.0f);
       scaleArray[i] = TSimdVector<FloatType>(1.0f, 1.0f, 1.0f, 1.0f);
       quatArray[i] = TQuaternion<FloatType>::fromAxisAngle(TVector3<FloatType>::YAxisVector, ScalarTool::degreesToRadians(MathTool::rangeRandom(0.0, 90.0)));
       orignalArray[i] = TSimdVector<FloatType>(0.0, 0.0, 0.0, 1.0);
    }

    for (auto&& _ : state) {
       for (int i = 0; i < kMathCalcCount; i++) {
           TMatrix4<FloatType> tMat = TMatrix4<FloatType>::template _simd_makeTransform<CalcPolicy>(translationArray[i], TSimdVector<FloatType>(1.0, 1.0, 1.0, 1.0), TQuaternion<FloatType>::Identity);
           TMatrix4<FloatType> sMat = TMatrix4<FloatType>::template _simd_makeTransform<CalcPolicy>(TSimdVector<FloatType>(0, 0, 0, 1), scaleArray[i], TQuaternion<FloatType>::Identity);
           TMatrix4<FloatType> rMat = TMatrix4<FloatType>::template _simd_makeTransform<CalcPolicy>(TSimdVector<FloatType>(0, 0, 0, 1), TSimdVector<FloatType>(1.0, 1.0, 1.0, 1.0), quatArray[i]);

           TMatrix4<FloatType> matrix = tMat.template _simd_multiOther<CalcPolicy>(rMat).template _simd_multiOther<CalcPolicy>(sMat);

           resultArray[i] = matrix.template _simd_transformVector4<CalcPolicy>(orignalArray[i]);
       }
    }
}

In the _simd_xxxx functions within the TMatrix4 class, all matrix operations are implemented internally within our project. Some more specific functions are initially implemented as follows:

RTM_DISABLE_SECURITY_COOKIE_CHECK inline void RTM_SIMD_CALL matrix_mul_fill_mode(
    matrix3x3f_arg0 lhs, matrix3x3f_arg1 rhs) RTM_NO_EXCEPT {
    matrix3x3f_arg0 out_m{};
    
    vector4f tmp = vector_mul(vector_dup_x(lhs.x_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.x_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.x_axis), rhs.z_axis, tmp);
    out_m.x_axis = tmp;

    tmp = vector_mul(vector_dup_x(lhs.y_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.y_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.y_axis), rhs.z_axis, tmp);
    out_m.y_axis = tmp;

    tmp = vector_mul(vector_dup_x(lhs.z_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.z_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.z_axis), rhs.z_axis, tmp);
    
    return out_m;
}
auto r = simd::matrix_mul_fill_mode(rhs, lhs);

Here, we encounter the issue of parameter copy passing and the return value problem you mentioned earlier. We have since made changes to such function calls:

RTM_DISABLE_SECURITY_COOKIE_CHECK inline void RTM_SIMD_CALL matrix_mul_fill_mode(
    matrix3x3d_argn lhs, matrix3x3d_argn rhs, matrix3x3d &out_m) RTM_NO_EXCEPT {
    vector4d tmp = vector_mul(vector_dup_x(lhs.x_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.x_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.x_axis), rhs.z_axis, tmp);
    out_m.x_axis = tmp;

    tmp = vector_mul(vector_dup_x(lhs.y_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.y_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.y_axis), rhs.z_axis, tmp);
    out_m.y_axis = tmp;

    tmp = vector_mul(vector_dup_x(lhs.z_axis), rhs.x_axis);
    tmp = vector_mul_add(vector_dup_y(lhs.z_axis), rhs.y_axis, tmp);
    tmp = vector_mul_add(vector_dup_z(lhs.z_axis), rhs.z_axis, tmp);
    out_m.z_axis = tmp;
}

TMatrix4<T> r{};
simd::matrix_mul_fill_mode(rhs.simdRef(), lhs.simdRef(), r.simdRef());

The main change was to modify the parameter passing of the matrix to be by reference. The performance after these modifications has already shown significant improvement.
Additionally, I would like to mention that the performance issues discussed here are occurring on ARM64 Android devices. The performance on Windows and Mac aligns with your expectations.

@daliziql
Copy link
Author

One more thing is that our project's C++ version is quite high. This PR did not handle compatibility with C++11 well, so I need to make some adjustments.

@daliziql daliziql force-pushed the only_float branch 4 times, most recently from 788f574 to a995282 Compare June 25, 2024 05:22
@nfrechette
Copy link
Owner

Thank you for the clarification. I will see if I can add a benchmark test based on your sample and see if I can reproduce locally.

What kind of processors/android device are you seeing this on?

I'll take a look at this in the next 2 weeks.

@daliziql
Copy link
Author

daliziql commented Jul 1, 2024

The processor is snapdragon-xr2-gen2

@daliziql
Copy link
Author

daliziql commented Jul 1, 2024

I encountered an issue with unit tests. The configurations build pull request / vs2022 (vs2022-clang, release, x64, -simd) and build pull request / vs2022 (vs2022-clang, release, x64, -avx) are indicating that some unit tests are failing. However, when I compile locally with the same CMake options, all tests pass. Do you have any additional information you can provide?

@nfrechette
Copy link
Owner

Yes, those failures are probably due to a known compiler/toolchain issue, see this PR for details: #212

I wouldn't worry about it for now. I'm waiting for github to update the image with a newer VS version that has a fixed clang version. Sadly, for reasons unknown, RTM ends up triggering a LOT of compiler bugs in various toolchains. Over the years, I've found dozens of bugs (and reported many) in msvc, gcc, and clang. Thankfully, it has gotten better over the years.

Copy link
Owner

@nfrechette nfrechette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of good stuff in here:

  • I like the idea of moving the vector mix details into its own header
  • I like the idea of using template specialization, I've ran into a lot of codegen issues with the existing function due to relying on constexpr branches
  • I like the idea of using std::enable_if to validate and branch variants

Just needs a bit of cleaning up and minor tweaks to bring back the missing AVX/NEON specializations for vector mix, see notes.

I'll profile the matrix argument passing change in the coming days and get back to you.

@@ -0,0 +1,193 @@
#pragma once
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to move this into its own header.
As a convention, all similar headers have the suffix .impl.h
The header is also missing the MIT license information, see other headers as example.

#include "rtm/types.h"
#include "rtm/impl/compiler_utils.h"
#include "rtm/scalarf.h"
#include "rtm/scalard.h"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this header required?


#if defined(RTM_SSE2_INTRINSICS) || defined(RTM_AVX_INTRINSICS)

#define SHUFFLE_MASK(a0,a1,b2,b3) ( (a0) | ((a1)<<2) | ((b2)<<4) | ((b3)<<6) )
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a standardized macro for this, see _MM_SHUFFLE in Intel intrinsic documentation.

// Float swizzle
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
template<int index0, int index1, int index2, int index3>
RTM_FORCE_INLINE vector4f vector_swizzle_impl(const vector4f& vec)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an implementation detail, we should move all of this into the rtm_impl namespace

{
return _mm_shuffle_ps(vec, vec, SHUFFLE_MASK(index0, index1, index2, index3));
}
template<> RTM_FORCE_INLINE vector4f vector_swizzle_impl<0, 1, 2, 3>(const vector4f& vec)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of using template specialization for this

//////////////////////////////////////////////////////////////////////////
// Writes a vector4 to aligned memory.
//////////////////////////////////////////////////////////////////////////
RTM_DISABLE_SECURITY_COOKIE_CHECK RTM_FORCE_INLINE void RTM_SIMD_CALL vector_store_aligned(vector4f_arg0 input, float* output) RTM_NO_EXCEPT
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem, vector4d version is missing along with unit test

output[0] = vector_get_x(input);
output[1] = vector_get_y(input);
#elif defined(RTM_NEON_INTRINSICS)
vst1_f32(output, *(float32x2_t*)&input);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reinterpret_cast isn't safe, would be better to getvget_low_f32

output[0] = vector_get_x(input);
output[1] = vector_get_y(input);
output[2] = vector_get_z(input);
#elif defined(RTM_NEON_INTRINSICS)
vst1_f32(output, *(float32x2_t*)&input);
vst1q_lane_f32(((float32_t*)output) + 2, input, 2);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem about vget_low_f32 and here the cast for output isn't necessary

//////////////////////////////////////////////////////////////////////////
// 3D cross product: lhs x rhs
//////////////////////////////////////////////////////////////////////////
RTM_DISABLE_SECURITY_COOKIE_CHECK RTM_FORCE_INLINE vector4f RTM_SIMD_CALL vector_cross3(vector4f_arg0 lhs, vector4f_arg1 rhs) RTM_NO_EXCEPT
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move this?

//////////////////////////////////////////////////////////////////////////
template <int index0, int index1, int index2, int index3,
typename std::enable_if<(index0 < 4 && index1 < 4 && index2 >= 4 && index3 >= 4), int>::type = 0>
vector4f vector_swizzle_with_index(vector4f_arg0 input0, vector4f_arg1 input1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot rename a function without retaining the old one and deprecating it. If this is meant to replace the old one, why give it a new name?
Also, here we are missing the other parts of the function signature (e.g. RTM_SIMD_CALL etc), see original function signature.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it an implementation detail? In that case, it belongs in the rtm_impl namespace

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of this new function wasn't my intention. In my project, I use if constexpr to perform compile-time branch evaluation. However, RTM needs to ensure normal operation under C++11, so I added the with_index function to handle the evaluation of template parameters that have been converted to int. Are you suggesting that instead of adding the with_index function, I should modify vector_mix itself? Or should I move the with_index function to the rmt_impl namespace?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the confusion, if this is a new function that is an implementation details (and not meant for end users) then it would belong in the rtm_impl namespace like the others. Functions in that namespace can be easily changed in any release as I don't maintain ABI compatibility for implementation details.

The lack of if constexpr is a pain, I agree but sadly many tool chains still commonly used don't fully implement later C++ versions. However, RTM provides rtm_impl::static_condition that allows similar usage (e.g. see old vector_mix impl). I'll let you decide which pattern to use for this, but I'd ask that if you decide to keep the with_index function, please move it to the rtm_impl namespace.

@nfrechette
Copy link
Owner

I added a benchmark to profile argument passing for matrix3x3f here: #219
On my M1 laptop, passing by value is a clear winner and the generated assembly by apple clang makes sense to me. The details are in the cpp files.

The results are as follow for me:

bm_matrix3x3_arg_passing_ref         24.0 ns         24.0 ns     28645439
bm_matrix3x3_arg_passing_value       16.1 ns         16.1 ns     43122832

This is in line with my expectations and confirms why I choose to pass as many aggregates by register as possible:

  • Passing by value means that functions use fewer instructions: no need to load/store, only maybe mov. This will improve their chances of getting inlined, speeding things up even further.
  • It leaves the job of loading/storing to the caller which can better schedule the instructions to hide some of their latency (assuming we hit L1 cache).
  • Intermediary aggregates can sometimes be held entirely in register (e.g. matrix_mul(a, matrix_mul(b, c))
  • Because we don't know how the input arguments were generated, when we load from memory, we may hit store-forwarding which is reasonably fast but not free. By passing by value, the caller has to pay the price of the load and any potential stalls when profiling

I will see with my Pixel 7 android phone if I can replicate the results when I get the chance this week. I suspect that the results will be consistent.

It may be worthwhile digging further into your benchmark and how you measured the difference. Did you only measure in a micro-benchmark or did you also observe an improvement in a non-synthetic use case (e.g. actual application)? Micro-benchmarks only offer a narrow view and can sometimes not capture the actual cost/benefit of an approach vs another. How did the assembly look before/after the change?

It may also be worthwhile trying to run my benchmark on your device to see if you can reproduce my findings. From there, perhaps you may be able to tweak it to showcase the results you've seen on your end.

@nfrechette
Copy link
Owner

The CI also ran my benchmark on x64 SSE2 with clang 14 and we can see there that the calling convention not returning aggregates by register indeed causes performance issues:

bm_matrix3x3_arg_passing_ref                27.8 ns         27.8 ns     25140949
bm_matrix3x3_arg_passing_value              51.8 ns         51.8 ns     13516762

I'll have to see what the generated assembly looks like there. Later this/next week I'll give that a try on my Zen2 desktop.

@daliziql
Copy link
Author

daliziql commented Jul 6, 2024

Our test results were also analyzed based on benchmark reports under Android, but the code in the benchmark is slightly more complex. The conclusion of the comparison was initially surprising: SIMD performance for double was actually better than for float, which was counterintuitive. We eventually found that there was a difference in parameter passing between the two, so we made some modifications to the parameter passing. The improved float performance indeed saw a significant boost. Regarding the code in the PR, there are many non-standard parts, and I will modify them one by one. Next week, I will also run your benchmark code on my device to see how it differs from my benchmark.

@nfrechette
Copy link
Owner

Here are some more notes profiling argument passing on my Zen2 desktop.

With VS2022 SSE2 and __vectorcall, the results are as follow:

bm_matrix3x3_arg_passing_current       18.8 ns         18.4 ns     37333333
bm_matrix3x3_arg_passing_ref           21.9 ns         21.5 ns     32000000
bm_matrix3x3_arg_passing_value         14.5 ns         14.4 ns     49777778

This is because I originally opted to not pass the second argument by value. This may appear sub-optimal in this synthetic benchmark but in practice, it depends a lot on the function signature. __vectorcall assigns registers in slots where a slot can be used by an int/float/vec. Even though an int might not use an XMM register, MSVC ignores this and that XMM register slot will not be assigned. As such, RTM allows some slack for aggregate types and won't use all registers. I'm not fully sold on this, it needs to be measure in a more complex benchmark that isn't synthetic. Here again, passing by value beats passing by reference.

With VS2022 SSE2 without __vectorcall, the results are as follow:

bm_matrix3x3_arg_passing_current       30.1 ns         28.6 ns     22400000
bm_matrix3x3_arg_passing_ref           21.8 ns         22.2 ns     34461538
bm_matrix3x3_arg_passing_value         33.1 ns         33.0 ns     21333333

Here, surprisingly, we can see that passing by value is slower than by reference. It is slower because with the default calling convention, vectors passed by value are written on the stack and thus passed by reference underneath the hood. Current is also slower. Here, current ends up returning the matrix by value on the stack while arguments are passed by reference, and it must be copied to the actual variable upon return. This is why it is slower than by reference where the return address is provided by an argument. __vectorcall is a clear winner here as it can avoid a lot of extra work.

With VS2022 SSE2 and Clang 17, the results are as follow:

bm_matrix3x3_arg_passing_current       30.0 ns         30.1 ns     24888889
bm_matrix3x3_arg_passing_ref           22.1 ns         21.5 ns     32000000
bm_matrix3x3_arg_passing_value         31.1 ns         31.8 ns     23578947

The numbers here are slightly different but consistent with the SSE2 non-vectorcall ones. The assembly is slightly different but the end result is the same for all 3.

With my Pixel 7, the results are as follow:

bm_matrix3x3_arg_passing_current       18.5 ns         18.5 ns     37515808
bm_matrix3x3_arg_passing_ref           25.9 ns         25.8 ns     27024174
bm_matrix3x3_arg_passing_value         18.5 ns         18.4 ns     37885260

Here as well, the numbers are consistent with my M1 laptop: passing and returning by value is faster than by reference.

Overall, it's tricky. What is optimal for NEON and vectorcall isn't optimal elsewhere.

@daliziql
Copy link
Author

daliziql commented Jul 8, 2024

Thank you very much for sharing the data. It seems that maintaining the original method of passing parameters by value would better meet the requirements of the rtm library. I have also extracted the business-related content from my local project and conducted benchmark tests specifically on passing parameters by value versus passing parameters by reference . The results show that the performance of both methods is almost identical, and there is no significant advantage of passing parameters by reference over passing parameters by value. I apologize for the premature and incorrect conclusion I made earlier. Once again, thank you for your professional response, which has been very beneficial to me. I will spend some more time analyzing the actual cause of the issue in my project.

@nfrechette
Copy link
Owner

Thank you for taking the time to dig deeper :)

Writing synthetics benchmarks is as much art as it is science. It is not trivial, especially for simple low level functions with few instructions like this. It is very easy to end up measuring side effects that you did not intend to measure or accounted for. I've made many mistakes in the past when writing them and in the end, sometimes, it isn't possible to capture the true impact that would be seen in real world usage. I've seen many cases where synthetic benchmarks show a win for something vs another which turns out to be the opposite in real code due to inlining and scheduling (of such small low level things). As an example, I spent at least 3-6 months figuring out how to properly benchmark animation decompression: https://github.com/nfrechette/acl/blob/ac1ea98938eef4d4bd4c9742a059cb886cad19d5/tools/acl_decompressor/sources/benchmark.cpp#L50

In the end, sometimes it isn't possible to write a function that is optimal on every architecture or every usage scenarios. RTM aims to provide sane defaults where possible, but it is expected that if you need specialized versions for your needs (due to the unique circumstances of your code) you'll write them outside RTM. For example, sometimes you need a matrix function inlined in a tight very hot loop even though in general you might not want to always inline it due to code size bloat. Another is with my animation compression library where I need stable versions of quaternion functions that won't change as I update RTM to ensure determinism over time. That being said, if you think something is more widely useful and should belong within RTM, feel free to submit a PR as you did and we can discuss and consider it :)

@nfrechette
Copy link
Owner

nfrechette commented Jul 9, 2024

Out of curiosity, I also added the same benchmark for matrix3x3d to see.

bm_matrix3x3d_arg_passing_current             34.6 ns         34.6 ns     20240340
bm_matrix3x3d_arg_passing_ref                 25.6 ns         25.5 ns     27587077
bm_matrix3x3d_arg_passing_value               34.3 ns         34.3 ns     20299917
bm_matrix3x3f_arg_passing_current             16.2 ns         16.2 ns     43312275
bm_matrix3x3f_arg_passing_ref                 24.0 ns         24.0 ns     29134134
bm_matrix3x3f_arg_passing_value               16.2 ns         16.2 ns     43387982

Doubles are slower but as you've found, when passing by reference, they are almost as fast even though they use scalar arithmetic instead of using SIMD pairs. With SIMD pairs, perhaps double could get faster under this synthetic benchmark. However, passing by value (which is currently the default for the first matrix3x3d argument) is quite a bit slower. It appears that with doubles, the aggregate structures are not passed by register as arguments nor are they returned by register as a return value. This means that we have to roundtrip to the stack every time :( I'll add a note to double check this with the NEON documentation as it appears that this could be improved. That might also change down the road once I optimize doubles to use SIMD pairs. Thanks to your input, we now have benchmarks to track this :)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants