Description
This
#include <immintrin.h>
float foo(__m512 x) {
return _mm512_reduce_add_ps(x);
}
define dso_local noundef float @foo(float vector[16])(<16 x float> noundef %x) local_unnamed_addr #0 {
entry:
%0 = tail call reassoc noundef float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> %x)
ret float %0
}
According to the LangRef, the reassoc
here means that the addition may happen in any order, which is not what Intel documents -- they specify a particular, "tree-like" order.
Even worse, we can chain two of these operations:
#include <immintrin.h>
float foo(__m512 x) {
float xr = _mm512_reduce_add_ps(x);
__m512 y = _mm512_set_ps(
xr, 1.8, 9.3, 0.0, 2.5, 0.0, 6.7, 9.0,
0.0, 1.8, 9.3, 0.0, 2.5, 0.0, 6.7, 9.0
);
return _mm512_reduce_add_ps(y);
}
Now the second addition may be arbitrarily re-associated with the first one. As far as I understand, there's nothing about reassoc
that constrains the re-association to only happen "inside" a single operation (and indeed, as a fast-math flag it is explicitly intended to apply when multiple subsequent operations are all reassoc
).
_mm512_reduce_add_ps should probably either use a vendor-specific intrinsic, or LLVM IR needs a version of vector.reduce.fadd
that explicitly specifies the "tree-like" reduction order documented by Intel.