-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
HW intrinsics: Poor codegen for operations requiring imm8 when the value is calculated to constant during JIT #11062
Comments
That's because of the cast to |
...which makes things a bit funny in the context of C#/CLR due to roundtrips of most calculations on |
Yup, importer fails to recognize arg2 as const bcs it does not walk beyond cast:
Whole importer dump
|
After simple fix:
|
No more poor codegen for HW intrinsics 😄 |
Will this fix also handle the following code ? (attempt to work this issue around)
|
Nope. It's too complex for codegen right now, but most importantly simple and equivalent approach works. |
...once you know it :-) |
Yeah. Making those parameters But then JIT's importer could and should probably do some trivial constant folding. constant/cast sequences do appear in other situations as well. Most commonly I've seen this with long constants where the C# compiler decides to save space and emits Usually it's not a good idea to mix IL importing with optimizations but there are some cases where that's reasonable, especially when dealing with IL's own encoding specifics/limitations. |
In this fix importer does not do any const folding just recognizes that cast is done on const. Later stages take care of const folding sufficiently well to let them do the job. The problem here is that Roslyn does constant folding and evidently it missed arg2. |
Yeah, right, it's not best of worlds we can have here. |
Until someone tries something like |
This should be handled by Roslyn. Will check. |
It's impossible, |
@mikedn You are right. Added function to fold const expressions during import and it works as expected: Test methods [MethodImpl(MethodImplOptions.NoInlining)]
private static unsafe void Test<TChar>() where TChar : unmanaged
{
Vector128<byte> v = Sse2.SetZeroVector128<byte>();
v = Sse2.ShiftRightLogical128BitLane(v, (byte)sizeof(TChar));
Console.WriteLine(Sse41.Extract(v, 0));
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static unsafe void Test1<TChar>() where TChar : unmanaged
{
Vector128<byte> v = Sse2.SetZeroVector128<byte>();
v = Sse2.ShiftRightLogical128BitLane(v, (byte)((3 * sizeof(TChar)) / 2 + 4 - sizeof(TChar)));
Console.WriteLine(Sse41.Extract(v, 0));
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static unsafe void Test2<TChar>() where TChar : unmanaged
{
Vector128<byte> v = Sse2.SetZeroVector128<byte>();
byte C;
if (sizeof(TChar) == sizeof(byte))
{
v = Sse2.ShiftRightLogical128BitLane(v, 1);
}
else if (sizeof(TChar) == sizeof(ushort))
{
v = Sse2.ShiftRightLogical128BitLane(v, 2);
}
else
{
throw new InvalidOperationException();
}
Console.WriteLine(Sse41.Extract(v, 0));
} Method Test Instructions as they come out of the scheduler
G_M44985_IG01: ; func=00, offs=000000H, size=0005H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
IN0006: 000000 4883EC28 sub rsp, 40
IN0007: 000004 90 nop
G_M44985_IG02: ; func=00, offs=000005H, size=0017H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
IN0001: 000005 C4E179EFC0 vpxor xmm0, xmm0, xmm0
IN0002: 00000A C4E17973D802 vpsrldq xmm0, xmm0, 2
IN0003: 000010 C4E37914C100 vpextrb ecx, xmm0, 0
[6A96D338] ptr arg pop 0
IN0004: 000016 E87DE6FFFF call System.Console:WriteLine(int)
IN0005: 00001B 90 nop
G_M44985_IG03: ; func=00, offs=00001CH, size=0005H, epilog, nogc, emitadd
IN0008: 00001C 4883C428 add rsp, 40
IN0009: 000020 C3 ret
Allocated method code size = 33 , actual size = 33 Method Test1 Instructions as they come out of the scheduler
G_M17352_IG01: ; func=00, offs=000000H, size=0005H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
IN0006: 000000 4883EC28 sub rsp, 40
IN0007: 000004 90 nop
G_M17352_IG02: ; func=00, offs=000005H, size=0017H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
IN0001: 000005 C4E179EFC0 vpxor xmm0, xmm0, xmm0
IN0002: 00000A C4E17973D805 vpsrldq xmm0, xmm0, 5
IN0003: 000010 C4E37914C100 vpextrb ecx, xmm0, 0
[6A97D9A8] ptr arg pop 0
IN0004: 000016 E845000000 call System.Console:WriteLine(int)
IN0005: 00001B 90 nop
G_M17352_IG03: ; func=00, offs=00001CH, size=0005H, epilog, nogc, emitadd
IN0008: 00001C 4883C428 add rsp, 40
IN0009: 000020 C3 ret
Allocated method code size = 33 , actual size = 33 Method Test2 Instructions as they come out of the scheduler
G_M17451_IG01: ; func=00, offs=000000H, size=0005H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
IN0006: 000000 4883EC28 sub rsp, 40
IN0007: 000004 90 nop
G_M17451_IG02: ; func=00, offs=000005H, size=0017H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
IN0001: 000005 C4E179EFC0 vpxor xmm0, xmm0, xmm0
IN0002: 00000A C4E17973D802 vpsrldq xmm0, xmm0, 2
IN0003: 000010 C4E37914C100 vpextrb ecx, xmm0, 0
[6A97DE18] ptr arg pop 0
IN0004: 000016 E855FEFFFF call System.Console:WriteLine(int)
IN0005: 00001B 90 nop
G_M17451_IG03: ; func=00, offs=00001CH, size=0005H, epilog, nogc, emitadd
IN0008: 00001C 4883C428 add rsp, 40
IN0009: 000020 C3 ret
Allocated method code size = 33 , actual size = 33 |
And in the importer we get: Method Test:
Method Test1
Method Test2
|
Fixes #19888 This enables expansin of imm HW Intrinsics which otherwise would be emitted with jump tables. Significantly improves user experience and performance
…uring import Fixes #19888 This enables expansion of imm HW Intrinsics which otherwise would be emitted as function calls with jump tables. Folding of const expressions happens during morphing and it means that directly before SSA generation and value numbering HW intrinsics are in final shape allowing them to participate in later optimizations once HW intrinsic support for value numbering is fully implemented. Significantly improves user experience.
…uring import Fixes #19888 This enables expansion of imm HW Intrinsics which otherwise would be emitted as function calls with jump tables. Folding of const expressions happens during morphing and it means that directly before SSA generation and value numbering HW intrinsics are in final shape allowing them to participate in later optimizations once HW intrinsic support for value numbering is fully implemented. Significantly improves user experience.
…g import Fixes #19888 Solution is based on a call to fgMorphTreeNode without using global morphing and null context. The execution path goes exactly through complete constant expression folding path used by jit during morphing. The chosen entry point is significantly better than gtFoldExpr as it is specifically designed to handle complete const expression tree folding including cast morphing. There seems to be no need to upgrade gtFoldExpr to have expanded functionality as it is called as a part of code execution path starting with fgMorphTreeNode call. Additional benefit of not using gtFoldExpr is that this method was destructive for it's operands when processed. In contrast fgMorphTreeNode returns new tree leaving old tree untouched. Therefore, in the case of failed folding, what should be a supported scenario, importer uses unmodified tree. This PR replaces dotnet#19889
… import Fixes #19888 Solution is based on a call to fgMorphTree without using global morphing and null context. The execution path goes exactly through complete constant expression folding path used by jit during morphing. The chosen entry point is significantly better than gtFoldExpr as it is specifically designed to handle complete const expression tree folding including cast morphing. There seems to be no need to upgrade gtFoldExpr to have expanded functionality as it is called as a part of code execution path starting with fgMorphTree call. Additional benefit of not using gtFoldExpr is that this method is destructive for it's operands when processed. In contrast fgMorphTree returns new tree leaving old tree untouched. Therefore, in the case of failed folding, what should be a supported scenario, importer uses unmodified tree. This PR replaces dotnet#19889
… import Fixes #19888 Solution is based on a call to fgMorphTree without using global morphing and null context. The execution path goes exactly through complete constant expression folding path used by jit during morphing. The chosen entry point is significantly better than gtFoldExpr as it is specifically designed to handle complete const expression tree folding including cast morphing. There seems to be no need to upgrade gtFoldExpr to have expanded functionality as it is called as a part of code execution path starting with fgMorphTree call. Additional benefit of not using gtFoldExpr is that this method is destructive for it's operands when processed. In contrast fgMorphTree returns new tree leaving old tree untouched. Therefore, in the case of failed folding, what should be a supported scenario, importer uses unmodified tree. This PR replaces dotnet#19889
… import Fixes #19888 Solution is based on a call to fgMorphTree without using global morphing and null context. The execution path goes exactly through complete constant expression folding path used by jit during morphing. The chosen entry point is significantly better than gtFoldExpr as it is specifically designed to handle complete const expression tree folding including cast morphing. There seems to be no need to upgrade gtFoldExpr to have expanded functionality as it is called as a part of code execution path starting with fgMorphTree call. Additional benefit of not using gtFoldExpr is that this method is destructive for it's operands when processed. In contrast fgMorphTree returns new tree leaving old tree untouched. Therefore, in the case of failed folding, what should be a supported scenario, importer uses unmodified tree. This PR replaces dotnet#19889
… import Fixes #19888 Solution is based on a call to fgMorphTree without using global morphing and null context. The execution path goes exactly through complete constant expression folding path used by jit during morphing. The chosen entry point is significantly better than gtFoldExpr as it is specifically designed to handle complete const expression tree folding including cast morphing. There seems to be no need to upgrade gtFoldExpr to have expanded functionality as it is called as a part of code execution path starting with fgMorphTree call. Additional benefit of not using gtFoldExpr is that this method is destructive for it's operands when processed. In contrast fgMorphTree returns new tree leaving old tree untouched. Therefore, in the case of failed folding, what should be a supported scenario, importer uses unmodified tree. This PR replaces dotnet#19889
I also had a similar problem with the following code: private static void Test()
{
Vector128<float> vec = Sse.SetZeroVector128();
vec = VectorInsertSingle(vec, 0, 1.0f);
vec = VectorInsertSingle(vec, 1, 1.5f);
vec = VectorInsertSingle(vec, 2, 2.0f);
vec = VectorInsertSingle(vec, 3, 2.5f);
}
public static Vector128<float> VectorInsertSingle(Vector128<float> vector, byte index, float value)
{
//Produces sub-optimal code (jump table fallback,
//JIT doesn't notice that the value is constant after inline).
return Sse41.Insert(vector, value, (byte)(index << 4));
} After inline, the index value is constant, however the jump table fallback is still called. This is the assembly produced for 00007FFD4B14361B vmovss xmm2,dword ptr [7FFD4B1436E0h]
00007FFD4B143624 xor r9d,r9d
00007FFD4B143627 call 00007FFD4B143360
00007FFD4B14362C mov rcx,qword ptr [rsp+70h]
00007FFD4B143631 mov rdx,qword ptr [rsp+78h]
00007FFD4B143636 lea r9,[rsp+60h]
00007FFD4B14363B lea rax,[rsp+20h]
00007FFD4B143640 mov qword ptr [rax],rcx
00007FFD4B143643 mov qword ptr [rax+8],rdx
00007FFD4B143647 mov rcx,r9
00007FFD4B14364A lea rdx,[rsp+20h]
00007FFD4B14364F vmovss xmm2,dword ptr [7FFD4B1436E4h]
00007FFD4B143658 mov r9d,10h
00007FFD4B14365E call 00007FFD4B143360
00007FFD4B143663 mov rcx,qword ptr [rsp+60h]
00007FFD4B143668 mov rdx,qword ptr [rsp+68h]
00007FFD4B14366D lea r9,[rsp+50h]
00007FFD4B143672 lea rax,[rsp+20h]
00007FFD4B143677 mov qword ptr [rax],rcx
00007FFD4B14367A mov qword ptr [rax+8],rdx
00007FFD4B14367E mov rcx,r9
00007FFD4B143681 lea rdx,[rsp+20h]
... The values are constant (0, 0x10 etc), but it still produces the call. Trying to workaround the issue I wrote another version using a switch case that use the constant values directly instead of relying on the JIT to do constant folding: [MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector128<float> VectorInsertSingle(Vector128<float> vector, byte index, float value)
{
switch (index)
{
case 0: return Sse41.Insert(vector, value, 0x00);
case 1: return Sse41.Insert(vector, value, 0x10);
case 2: return Sse41.Insert(vector, value, 0x20);
case 3: return Sse41.Insert(vector, value, 0x30);
default: throw new ArgumentOutOfRangeException(nameof(index));
}
} This also doesn't work well because the JIT is not capable of inlining methods with switch cases so the code produced is also sub-optimal. Finally we have this version using if/else: [MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector128<float> VectorInsertSingle(Vector128<float> vector, byte index, float value)
{
if (index == 0)
{
return Sse41.Insert(vector, value, 0x00);
}
else if (index == 1)
{
return Sse41.Insert(vector, value, 0x10);
}
else if (index == 2)
{
return Sse41.Insert(vector, value, 0x20);
}
else if (index == 3)
{
return Sse41.Insert(vector, value, 0x30);
}
else
{
throw new ArgumentOutOfRangeException(nameof(index));
}
} This version produces more or less the expected code (this is the same 00007FFD4B1235EC vxorps xmm0,xmm0,xmm0
00007FFD4B1235F1 vmovss xmm1,dword ptr [7FFD4B123668h]
00007FFD4B1235FA vinsertps xmm0,xmm0,xmm1,0
00007FFD4B123600 vmovapd xmmword ptr [rsp+50h],xmm0
00007FFD4B123607 vmovapd xmm0,xmmword ptr [rsp+50h]
00007FFD4B12360E vmovss xmm1,dword ptr [7FFD4B12366Ch]
00007FFD4B123617 vinsertps xmm0,xmm0,xmm1,10h
00007FFD4B12361D vmovapd xmmword ptr [rsp+40h],xmm0
00007FFD4B123624 vmovapd xmm0,xmmword ptr [rsp+40h]
00007FFD4B12362B vmovss xmm1,dword ptr [7FFD4B123670h]
00007FFD4B123634 vinsertps xmm0,xmm0,xmm1,20h
00007FFD4B12363A vmovapd xmmword ptr [rsp+30h],xmm0
00007FFD4B123641 vmovapd xmm0,xmmword ptr [rsp+30h]
00007FFD4B123648 vmovss xmm1,dword ptr [7FFD4B123674h]
00007FFD4B123651 vinsertps xmm0,xmm0,xmm1,30h
00007FFD4B123657 vmovapd xmmword ptr [rsp+20h],xmm0
00007FFD4B12365E add rsp,68h
00007FFD4B123662 ret The method was properly inlined, however there are still some register moves that I believe that are not necessary (for example XMM0 is spilled then immediately moved back). Furthermore this instruction also accepts a memory location, so the whole method after inline could be just a single instruction, something like this: vxorps xmm0,xmm0,xmm0
vinsertps xmm0,xmm0,dword ptr [7FFD4B123668h],0
vinsertps xmm0,xmm0,dword ptr [7FFD4B12366Ch],10h
vinsertps xmm0,xmm0,dword ptr [7FFD4B123670h],20h
vinsertps xmm0,xmm0,dword ptr [7FFD4B123674h],30h I know there are more efficient ways to load a vector, this was just to demonstrate the usage of the Those tests were made with dotnet version Another thing that may be worth pointing out is that the fyi @fiigii Sorry if this is not the right place for that, but the issue is similar so I felt that opening a separate one was not necessary. |
All new work on HW intrinsics is in master branch what means that it is not available in 2.2 preview2 and will not be available in 2.2 RTM HW Intrinsic with immediate operand expansionThe method below cannot be handled correctly by Jit even with current not accepted solution from dotnet/coreclr#19898. public static Vector128<float> VectorInsertSingle(Vector128<float> vector, byte index, float value)
{
//Produces sub-optimal code (jump table fallback,
//JIT doesn't notice that the value is constant after inline).
return Sse41.Insert(vector, value, (byte)(index << 4));
} The first problem is that Jit needs to recognize There is still a second problem here caused by method inlining. Inlining happens after decision on intrinsic expansion is made and Jit currently will not reconsider if immediate HW intrinsic should be expanded after it is inlined and arguments semantics changes like in your code. As a result your method despite working on semantically valid assumptions for Jit is not syntactically equivalent to intrinsic with imm8 constant. I have created issue to track the second problem as this may be important in many situations as long as we do not have |
@gdkchan Thanks for reporting the codegen issue. The problem is different from this issue but also detected in https://github.com/dotnet/coreclr/issues/17108. @CarolEidt @AndyAyersMS @mikedn @tannergooding
The first issue can be solved by morphing the expressions in the importer (like dotnet/coreclr#19898), but both the two issues can be naturally solved by https://github.com/dotnet/coreclr/issues/17108. I know there is some difficulty to introduce call-node in lowering, can we just try to expend intrinsic behind inlining and constant propagation/folding? |
IMO we should reevaluate several problems:
IMHO there is no single simple solution to problems which will be haunting imm HW intrinsic implementation, whether it would be dotnet/coreclr#19898 or #9989 or dotnet/csharplang#886 or current implementation of non-const parameter handling. Scenarios which as turns out should be supported are much more complex:
IMHO we should do: What could be helpful for (ii) it would be HW intrinsic support implementation for value numbering and following optimizations what could form a basis for further optimizations and even in more distant future autovectorization support. |
Basically what you want is to expand on import and then attempt to expand again during lowering, hoping that non-const parameters have become const. Not expanding on import at all is not an option, you really don't want to keep a bunch of call nodes if you can avoid it. Allways expanding on import and falling back to call in lowering is not an option as well, at least not a trivial one. So you really don't have much of a choice. Well, there's also the choice of telling people - hey, that parameter must be a constant. Period. |
Un-assigning myself |
This and #9989 are basically the same, they could probably be merged. The rough requirement is that we support transforming a |
In the repro code below
sizeof(TChar)
is known during JIT, still code is suboptimal in disasm - some helper (?) method is invoked instead of instruction and SIMD vector value is placed on stack due to this.This issue is quite notable with shifts and typecasting during concrete value calculation, but to my memory there were similar cases with other instructions.
Repro code:
Disasm:
--- ...\Program2.cs ---
Vector128 v = Sse2.SetZeroVector128();
000007FE75244A20 sub rsp,58h
000007FE75244A24 xor eax,eax
000007FE75244A26 mov qword ptr [rsp+40h],rax
000007FE75244A2B mov qword ptr [rsp+48h],rax
000007FE75244A30 pxor xmm0,xmm0
000007FE75244A34 movaps xmmword ptr [rsp+40h],xmm0
000007FE75244A39 movaps xmm0,xmmword ptr [rsp+40h]
000007FE75244A3E movaps xmmword ptr [rsp+30h],xmm0
000007FE75244A43 lea rcx,[rsp+40h]
000007FE75244A48 lea rdx,[rsp+20h]
000007FE75244A4D mov r8,qword ptr [rsp+30h]
000007FE75244A52 mov qword ptr [rdx],r8
000007FE75244A55 mov r8,qword ptr [rsp+38h]
000007FE75244A5A mov qword ptr [rdx+8],r8
000007FE75244A5E lea rdx,[rsp+20h]
000007FE75244A63 mov r8d,1
000007FE75244A69 call 000007FE752421A0 <=================
000007FE75244A6E movaps xmm0,xmmword ptr [rsp+40h]
000007FE75244A73 pextrb ecx,xmm0,0
000007FE75244A79 call 000007FE75241AC8
000007FE75244A7E nop
000007FE75244A7F add rsp,58h
000007FE75244A83 ret
--- No source file -------------------------------------------------------------
category:cq
theme:hardware-intrinsics
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: