-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Implement Vector.AddSaturate/SubtractSaturate #107193
Conversation
Note regarding the
|
1 similar comment
Note regarding the
|
Tagging subscribers to this area: @dotnet/area-system-numerics |
Could the existing internal runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs Line 3904 in 4c10eff
|
I removed the existing methods that I could find in this PR, but perhaps there are additional methods I have missed. |
if (AdvSimd.IsSupported) | ||
{ | ||
if (typeof(T) == typeof(byte)) | ||
{ | ||
return AdvSimd.AddSaturate(left.AsByte(), right.AsByte()).As<byte, T>(); | ||
} | ||
if (typeof(T) == typeof(sbyte)) | ||
{ | ||
return AdvSimd.AddSaturate(left.AsSByte(), right.AsSByte()).As<sbyte, T>(); | ||
} | ||
if (typeof(T) == typeof(short)) | ||
{ | ||
return AdvSimd.AddSaturate(left.AsInt16(), right.AsInt16()).As<short, T>(); | ||
} | ||
if (typeof(T) == typeof(ushort)) | ||
{ | ||
return AdvSimd.AddSaturate(left.AsUInt16(), right.AsUInt16()).As<ushort, T>(); | ||
} | ||
if (typeof(T) == typeof(int)) | ||
{ | ||
return AdvSimd.AddSaturate(left.AsInt32(), right.AsInt32()).As<int, T>(); | ||
} | ||
if (typeof(T) == typeof(uint)) | ||
{ | ||
return AdvSimd.AddSaturate(left.AsUInt32(), right.AsUInt32()).As<uint, T>(); | ||
} | ||
if (typeof(T) == typeof(long)) | ||
{ | ||
return AdvSimd.AddSaturate(left.AsInt64(), right.AsInt64()).As<long, T>(); | ||
} | ||
if (typeof(T) == typeof(ulong)) | ||
{ | ||
return AdvSimd.AddSaturate(left.AsUInt64(), right.AsUInt64()).As<ulong, T>(); | ||
} | ||
} | ||
|
||
if (Sse2.IsSupported) | ||
{ | ||
if (typeof(T) == typeof(byte)) | ||
{ | ||
return Sse2.AddSaturate(left.AsByte(), right.AsByte()).As<byte, T>(); | ||
} | ||
if (typeof(T) == typeof(sbyte)) | ||
{ | ||
return Sse2.AddSaturate(left.AsSByte(), right.AsSByte()).As<sbyte, T>(); | ||
} | ||
if (typeof(T) == typeof(short)) | ||
{ | ||
return Sse2.AddSaturate(left.AsInt16(), right.AsInt16()).As<short, T>(); | ||
} | ||
if (typeof(T) == typeof(ushort)) | ||
{ | ||
return Sse2.AddSaturate(left.AsUInt16(), right.AsUInt16()).As<ushort, T>(); | ||
} | ||
} | ||
|
||
if (PackedSimd.IsSupported) | ||
{ | ||
if (typeof(T) == typeof(byte)) | ||
{ | ||
return PackedSimd.AddSaturate(left.AsByte(), right.AsByte()).As<byte, T>(); | ||
} | ||
if (typeof(T) == typeof(sbyte)) | ||
{ | ||
return PackedSimd.AddSaturate(left.AsSByte(), right.AsSByte()).As<sbyte, T>(); | ||
} | ||
if (typeof(T) == typeof(short)) | ||
{ | ||
return PackedSimd.AddSaturate(left.AsInt16(), right.AsInt16()).As<short, T>(); | ||
} | ||
if (typeof(T) == typeof(ushort)) | ||
{ | ||
return PackedSimd.AddSaturate(left.AsUInt16(), right.AsUInt16()).As<ushort, T>(); | ||
} | ||
} | ||
|
||
if (IsHardwareAccelerated) | ||
{ | ||
return VectorMath.AddSaturate<Vector128<T>, T>(left, right); | ||
} | ||
|
||
return Create( | ||
Vector64.AddSaturate(left._lower, right._lower), | ||
Vector64.AddSaturate(left._upper, right._upper) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an approach we want to take for most of the xplat APIs, which are considered "perf critical".
Rather instead we want them to be implemented in the JIT so that they don't eat away at the inlining budget or run into other issues.
Doing this requires adding an AddSaturate
entry to https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsiclistxarch.h and https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsiclistarm64.h, for the relevant vector sizes (and mostly mirroring the entry for op_Additition
)
You'd then add handling for that in https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicxarch.cpp#L1387-L1402 and https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicarm64.cpp#L700-L710, mostly following op_Addition
again; but since we don't have a general GT_*
kind, you'd instead use gtNewSimdHWIntrinsicNode(retType, op1, op2, intrinsic, simdBaseJitType, simdSize)
where intrinsic
is NI_ISA_Name
, such as NI_SSE2_AddSaturate
For int
, uint
, long
, and ulong
on x86/x64, you'd need to implement handling as well. Unsigned
is simple as its effectively just the following, as x + y
will always be greater than or equal to either input, unless it overflows:
var tmp = x + y;
return Vector.ConditionalSelect(
Vector.LessThan(tmp, x),
MaxValue,
tmp
);
Signed is a bit trickier, but it basically boils down to (there may be a more efficient way, but this is the basics):
var z = x + y;
return Vector.ConditionalSelect(
(((x ^ y) ^ SignMask) & (x ^ z)) >> (sizeof(T) * 8 - 1),
SignMask ^ (z >> (sizeof(T) * 8 - 1))),
z
);
This works because x + y
for differing signs cannot overflow; while for same signs it can. In general, given two bool
you can detect equality via x ^ y ^ 1
and inequality via x ^ y
. Given that we want (signX == signY) && (signX != signZ)
that gives us the (x ^ y ^ 1) & (x ^ z)
given above to determine if overflow occurred. We then arithmetic right shift to propagate the bit so we get AllBitsSet
(overflow occurred) or Zero
(no overflow) per-element.
If overflow did occur, then we know that a negative result means it should be MaxValue
while a positive result means it should be MinValue
. Artihmetic shifting z
gives us AllBitsSet
(negative) or Zero
(positive) on a per-element basis, we just need to xor with the sign mask. This gives us 0xFFFF_FFFF ^ 0x8000_0000
or 0x0000_0000 ^ 0x8000_0000
, thus negative results become 0x7FFF_FFFF
(MaxValue
) and positive results become 0x8000_0000
(MinValue
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an approach we want to take for most of the xplat APIs, which are considered "perf critical".
Understood. Thanks for the clear instructions on how to implement this in JIT instead 👍 .
For
int
,uint
,long
, andulong
on x86/x64, you'd need to implement handling as well.
There is a "fallback" algorithm in the PR already in VectorMath
class.
Should the substitution be done in JIT as well for x86/x64 case too, or does it suffice to leave as it is for those cases? If handled in JIT, should the fallback in VectorMath
be kept?
I'll try setting this PR as draft until I have successfully made necessary changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If handled in JIT, should the fallback in VectorMath be kept?
We're generally using VectorMath
for cases that we can't implement in the JIT and which are unlikely to ever be inlined. We do the more naive thing for the managed implementation otherwise (often just decomposing into operating on lower/upper halves and allowing the loop to only exist as part of Vector64<T>
)
@lilinus in case you didn't knew, there is a patch created by the format leg: https://github.com/dotnet/runtime/actions/runs/10928828860?pr=107193 (under artifacts) $ cd /path/to/runtime
$ unzip ~/Downloads/format.linux.patch.zip
$ git apply format.patch
$ rm format.patch
# commit and push |
I should be getting to this soon, just working through the backlog of PRs now that I can start focusing on things for .NET 10 |
if (simdSize == 8 && varTypeIsLong(simdBaseType)) | ||
{ | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be skipped for TYP_LONG
, it should just use AddSaturateScalar
which is already exposed.
if (simdSize == 8 && varTypeIsLong(simdBaseType)) | ||
{ | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, this should just use SubtractSaturateScalar
op1 = impSIMDPopStack(); | ||
retNode = gtNewSimdHWIntrinsicNode(retType, op1, op2, intrinsic, simdBaseJitType, simdSize); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling for int
, uint
, long
, and ulong
should likely still be added.
For unsigned this is simply:
var z = x + y;
return ConditionalSelect(LessThan(z, x), Create(MaxValue), z);
For signed this is (I believe, did this ad-hoc so didn't double check the logic is 100% accurate):
var z = x + y;
var o = ((x ^ y) ^ AllBitsSet) & (x ^ z);
return ConditionalSelect(IsNegative(o), ConditionalSelect(IsNegative(z), Create(MaxValue), Create(MinValue)), z);
-- For detecting overflow we only really care about whether the sign of both inputs were the same and then if the output sign differs from that. In such a case that we have overflow, we then know that a negative result should've clamped to MaxValue (it overflowed from positive and became a negative result) and positive result should've clamped to MinValue. So the logic is detecting just that, if the sign of x and y are the same, then xoring the two will clear the sign, then xoring it with AllBitsSet means we get 1
in the sign bit. We then and that with the check of whether x and z (the result) differ which tells us if overflow occurred for that value if the resulting mask is negative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subtraction is much the same, just with a GreaterThan
comparison for unsigned and the same logic for signed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is implemented fallback already in VectorMath.cs
. Should I try to move it to JIT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We try and avoid having relatively simple code for the core xplat APIs from being in managed, as there is a non-trivial cost to inlining, doing dead code elimination, and generally eating into the JIT budget for importing all the IR.
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
Implement #82559