Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Implement Vector.AddSaturate/SubtractSaturate #107193

Closed
wants to merge 10 commits into from

Conversation

lilinus
Copy link
Contributor

@lilinus lilinus commented Aug 30, 2024

Implement #82559

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

1 similar comment
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Aug 30, 2024
@lilinus lilinus changed the title Add sub saturate Implement.AddSaturate/SubtractSaturate Aug 30, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

@xtqqczze
Copy link
Contributor

Could the existing internal AddSaturate and SubtractSaturate methods be removed?

internal static Vector128<byte> AddSaturate(Vector128<byte> left, Vector128<byte> right)

@lilinus lilinus changed the title Implement.AddSaturate/SubtractSaturate Implement Vector.AddSaturate/SubtractSaturate Sep 16, 2024
@lilinus
Copy link
Contributor Author

lilinus commented Sep 16, 2024

Could the existing internal AddSaturate and SubtractSaturate methods be removed?

I removed the existing methods that I could find in this PR, but perhaps there are additional methods I have missed.

@lilinus lilinus marked this pull request as ready for review September 16, 2024 08:48
Comment on lines 95 to 179
if (AdvSimd.IsSupported)
{
if (typeof(T) == typeof(byte))
{
return AdvSimd.AddSaturate(left.AsByte(), right.AsByte()).As<byte, T>();
}
if (typeof(T) == typeof(sbyte))
{
return AdvSimd.AddSaturate(left.AsSByte(), right.AsSByte()).As<sbyte, T>();
}
if (typeof(T) == typeof(short))
{
return AdvSimd.AddSaturate(left.AsInt16(), right.AsInt16()).As<short, T>();
}
if (typeof(T) == typeof(ushort))
{
return AdvSimd.AddSaturate(left.AsUInt16(), right.AsUInt16()).As<ushort, T>();
}
if (typeof(T) == typeof(int))
{
return AdvSimd.AddSaturate(left.AsInt32(), right.AsInt32()).As<int, T>();
}
if (typeof(T) == typeof(uint))
{
return AdvSimd.AddSaturate(left.AsUInt32(), right.AsUInt32()).As<uint, T>();
}
if (typeof(T) == typeof(long))
{
return AdvSimd.AddSaturate(left.AsInt64(), right.AsInt64()).As<long, T>();
}
if (typeof(T) == typeof(ulong))
{
return AdvSimd.AddSaturate(left.AsUInt64(), right.AsUInt64()).As<ulong, T>();
}
}

if (Sse2.IsSupported)
{
if (typeof(T) == typeof(byte))
{
return Sse2.AddSaturate(left.AsByte(), right.AsByte()).As<byte, T>();
}
if (typeof(T) == typeof(sbyte))
{
return Sse2.AddSaturate(left.AsSByte(), right.AsSByte()).As<sbyte, T>();
}
if (typeof(T) == typeof(short))
{
return Sse2.AddSaturate(left.AsInt16(), right.AsInt16()).As<short, T>();
}
if (typeof(T) == typeof(ushort))
{
return Sse2.AddSaturate(left.AsUInt16(), right.AsUInt16()).As<ushort, T>();
}
}

if (PackedSimd.IsSupported)
{
if (typeof(T) == typeof(byte))
{
return PackedSimd.AddSaturate(left.AsByte(), right.AsByte()).As<byte, T>();
}
if (typeof(T) == typeof(sbyte))
{
return PackedSimd.AddSaturate(left.AsSByte(), right.AsSByte()).As<sbyte, T>();
}
if (typeof(T) == typeof(short))
{
return PackedSimd.AddSaturate(left.AsInt16(), right.AsInt16()).As<short, T>();
}
if (typeof(T) == typeof(ushort))
{
return PackedSimd.AddSaturate(left.AsUInt16(), right.AsUInt16()).As<ushort, T>();
}
}

if (IsHardwareAccelerated)
{
return VectorMath.AddSaturate<Vector128<T>, T>(left, right);
}

return Create(
Vector64.AddSaturate(left._lower, right._lower),
Vector64.AddSaturate(left._upper, right._upper)
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an approach we want to take for most of the xplat APIs, which are considered "perf critical".

Rather instead we want them to be implemented in the JIT so that they don't eat away at the inlining budget or run into other issues.

Doing this requires adding an AddSaturate entry to https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsiclistxarch.h and https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsiclistarm64.h, for the relevant vector sizes (and mostly mirroring the entry for op_Additition)

You'd then add handling for that in https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicxarch.cpp#L1387-L1402 and https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicarm64.cpp#L700-L710, mostly following op_Addition again; but since we don't have a general GT_* kind, you'd instead use gtNewSimdHWIntrinsicNode(retType, op1, op2, intrinsic, simdBaseJitType, simdSize) where intrinsic is NI_ISA_Name, such as NI_SSE2_AddSaturate

For int, uint, long, and ulong on x86/x64, you'd need to implement handling as well. Unsigned is simple as its effectively just the following, as x + y will always be greater than or equal to either input, unless it overflows:

var tmp = x + y;
return Vector.ConditionalSelect(
    Vector.LessThan(tmp, x),
    MaxValue,
    tmp
);

Signed is a bit trickier, but it basically boils down to (there may be a more efficient way, but this is the basics):

var z = x + y;

return Vector.ConditionalSelect(
    (((x ^ y) ^ SignMask) & (x ^ z)) >> (sizeof(T) * 8 - 1),
    SignMask ^ (z >> (sizeof(T) * 8 - 1))),
    z
);

This works because x + y for differing signs cannot overflow; while for same signs it can. In general, given two bool you can detect equality via x ^ y ^ 1 and inequality via x ^ y. Given that we want (signX == signY) && (signX != signZ) that gives us the (x ^ y ^ 1) & (x ^ z) given above to determine if overflow occurred. We then arithmetic right shift to propagate the bit so we get AllBitsSet (overflow occurred) or Zero (no overflow) per-element.

If overflow did occur, then we know that a negative result means it should be MaxValue while a positive result means it should be MinValue. Artihmetic shifting z gives us AllBitsSet (negative) or Zero (positive) on a per-element basis, we just need to xor with the sign mask. This gives us 0xFFFF_FFFF ^ 0x8000_0000 or 0x0000_0000 ^ 0x8000_0000, thus negative results become 0x7FFF_FFFF (MaxValue) and positive results become 0x8000_0000 (MinValue)

Copy link
Contributor Author

@lilinus lilinus Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an approach we want to take for most of the xplat APIs, which are considered "perf critical".

Understood. Thanks for the clear instructions on how to implement this in JIT instead 👍 .

For int, uint, long, and ulong on x86/x64, you'd need to implement handling as well.

There is a "fallback" algorithm in the PR already in VectorMath class.
Should the substitution be done in JIT as well for x86/x64 case too, or does it suffice to leave as it is for those cases? If handled in JIT, should the fallback in VectorMath be kept?

I'll try setting this PR as draft until I have successfully made necessary changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If handled in JIT, should the fallback in VectorMath be kept?

We're generally using VectorMath for cases that we can't implement in the JIT and which are unlikely to ever be inlined. We do the more naive thing for the managed implementation otherwise (often just decomposing into operating on lower/upper halves and allowing the loop to only exist as part of Vector64<T>)

@am11
Copy link
Member

am11 commented Sep 19, 2024

@lilinus in case you didn't knew, there is a patch created by the format leg: https://github.com/dotnet/runtime/actions/runs/10928828860?pr=107193 (under artifacts)

$ cd /path/to/runtime
$ unzip ~/Downloads/format.linux.patch.zip
$ git apply format.patch
$ rm format.patch
# commit and push

@tannergooding
Copy link
Member

I should be getting to this soon, just working through the backlog of PRs now that I can start focusing on things for .NET 10

Comment on lines +717 to +720
if (simdSize == 8 && varTypeIsLong(simdBaseType))
{
break;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be skipped for TYP_LONG, it should just use AddSaturateScalar which is already exposed.

Comment on lines +2129 to +2132
if (simdSize == 8 && varTypeIsLong(simdBaseType))
{
break;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this should just use SubtractSaturateScalar

op1 = impSIMDPopStack();
retNode = gtNewSimdHWIntrinsicNode(retType, op1, op2, intrinsic, simdBaseJitType, simdSize);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling for int, uint, long, and ulong should likely still be added.

For unsigned this is simply:

var z = x + y;
return ConditionalSelect(LessThan(z, x), Create(MaxValue), z);

For signed this is (I believe, did this ad-hoc so didn't double check the logic is 100% accurate):

var z = x + y;
var o = ((x ^ y) ^ AllBitsSet) & (x ^ z);
return ConditionalSelect(IsNegative(o), ConditionalSelect(IsNegative(z), Create(MaxValue), Create(MinValue)), z);

-- For detecting overflow we only really care about whether the sign of both inputs were the same and then if the output sign differs from that. In such a case that we have overflow, we then know that a negative result should've clamped to MaxValue (it overflowed from positive and became a negative result) and positive result should've clamped to MinValue. So the logic is detecting just that, if the sign of x and y are the same, then xoring the two will clear the sign, then xoring it with AllBitsSet means we get 1 in the sign bit. We then and that with the check of whether x and z (the result) differ which tells us if overflow occurred for that value if the resulting mask is negative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subtraction is much the same, just with a GreaterThan comparison for unsigned and the same logic for signed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implemented fallback already in VectorMath.cs. Should I try to move it to JIT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We try and avoid having relatively simple code for the core xplat APIs from being in managed, as there is a non-trivial cost to inlining, doing dead code elimination, and generally eating into the JIT budget for importing all the IR.

@lilinus lilinus marked this pull request as draft January 14, 2025 10:17
Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area-System.Numerics community-contribution Indicates that the PR has been added by a community member new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants