-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
port SpanHelpers.IndexOfAny(ref byte, byte, byte, int) to Vector128/256 #73556
Conversation
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsIt's #73384 again I've tried to repro #73474 locally and I've failed. Now I want to use the CI to verify if the issue is gone or not. ./build.sh --os browser -c Release
make -C src/mono/wasm provision-wasm
export EMSDK_PATH=/home/adam/projects/runtime/src/mono/wasm/emsdk
./dotnet.sh build /t:Test /p:TargetOS=Browser /p:TargetArchitecture=wasm /p:RunAOTCompilation=true /p:EnableAggressiveTrimming=true -c Release ./src/libraries/System.Runtime/tests
|
@jkotas the |
What fixed the underlying issue? The crash was caused by reading beyond end of the buffer. The CI may be green by luck - the memory after the end of the buffer might have been initialized just right to avoid the crash in the given run. @vargaz or @lambdageek should inspect the IL to confirm that the problem is gone. Also, we should find the change that fixed it. |
Ok, #73493 fixed the out of bound read, replaced it by rejecting the whole method with BadImageFormatException. Do we still have a functional issue at runtime? |
System.Memory tests exercise this code path and they are passing on machine with AVX2
|
Do the tests exercise the AOT code of method, or are they falling back to the interpreter silently? If it is the latter, we are likely going to see perf regressions. |
I am using instructions provided by @lambdageek in #73474 (comment) to run these tests /p:RunAOTCompilation=true /p:EnableAggressiveTrimming=true I would expect that they exercise the AOT code of method. How can I check if they don't fall back to interpreter silently? |
I don’t know how to check that. Somebody from the Mono team needs to help you with at that. |
I guess |
I've tried to benchmark WASM AOT but I don't think that it's currently possible: dotnet/BenchmarkDotNet#1818 (comment) |
/azp run runtime-wasm |
Azure Pipelines successfully started running 1 pipeline(s). |
I'm trying the System.Memory tests locally. I would expect any call to IndexOfAny to throw BIFE on wasm AOT. there shouldn't be a fallback to the interpreter, AFAIK. Perhaps we're skipping some of these tests already for other reasons |
Ah interesting. So this PR is good; i'm working, separately, on adding a fallthru check to the interpreter |
Do you agree that this PR is going to introduce perf regression since the method won’t be AOT compiled anymore? |
Yes, I should qualify "good" - this PR wont' crash the AOT compiler or the runtime anymore. The perf won't be good. |
actually dynamically this code is unreachable |
@jkotas @lambdageek @vitek-karas I am using similar patterns in few other PRs and aiming to port all But why? Is it because I am using |
Yes, something in the refactored methods exposed a bug in the IL linker. Native AOT does not have the bug. It is not sharing the relevant part of the linker. |
I'm not sure, but I think I have a workaround: if I change the last else
{
Debug.Assert (Vector.IsHardwareAccelerated); then I can also delete Debug.Fail("Unreachable");
goto NotFound; and then in the IL that comes out of the linker, I don't see a conditional jump at the end of the method anymore. (I'm not sure if this has any undesired impact on CoreCLR performance) |
My guess is that the bug is triggered by certain shape of the method flowgraph. It is not about types used. |
Sorry - I didn't get to this yet (first on my list after emails/management)... Early1:
Early2:
if (something)
{
goto Early1;
}
else
{
goto Early2;
} Basically the method ends with an if/else where both branches go backwards. In that case it's invalid to remove the "false" branch, because there's nothing left in the method - and even though that branch is technically not reachable, it needs to exist for validity. |
Can it still remove it and fold |
I'll have to go through the code - but in general the code in the linker is very simplistic and tries to avoid changing the number of instructions (it's problematic to move code around in that setup) - so it tries to keep the same number of instructions. It's also relatively problematic to fully remove the ldc.0 and so on - if the condition is over a call then removing the entire call is relatively complicated - and we don't do that. Which means we need to "pop" somehow. I'm considering either disabling the optimization in this case or to force insert a dummy ret at the end of the method. |
ldnull + throw may be better. Ret requires exactly zero or one items on the evaluation stack. |
@jkotas @lambdageek @vitek-karas Thanks for the explanation! I've removed all the gotos, similified the code and ensured that there are no regressions for configs that I can benchmark (CLR x64 AVX2 AVX & arm64 AdvSIMD). The perf has not regressed, it's on par. When reading the AVX2 (x64)BenchmarkDotNet=v0.13.1.1845-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
[Host] : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
Job-FHASLE : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-JRHZCS : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
LaunchCount=9
Code gen before: ; System.SpanHelpers.IndexOfAny(Byte ByRef, Byte, Byte, Int32)
push rsi
vzeroupper
movzx edx,dl
mov eax,edx
movzx r8d,r8b
mov r10d,r8d
xor r11d,r11d
mov esi,r9d
movsxd r9,r9d
add r9,0FFFFFFFFFFFFFFF0
js near ptr M00_L01
mov rsi,r9
jmp near ptr M00_L13
M00_L00:
add rsi,0FFFFFFFFFFFFFFF8
movzx edx,byte ptr [rcx+r11]
cmp eax,edx
je near ptr M00_L05
cmp r10d,edx
je near ptr M00_L05
movzx edx,byte ptr [rcx+r11+1]
cmp eax,edx
je near ptr M00_L06
cmp r10d,edx
je near ptr M00_L06
movzx edx,byte ptr [rcx+r11+2]
cmp eax,edx
je near ptr M00_L07
cmp r10d,edx
je near ptr M00_L07
movzx edx,byte ptr [rcx+r11+3]
cmp eax,edx
je near ptr M00_L08
cmp r10d,edx
je near ptr M00_L08
movzx edx,byte ptr [rcx+r11+4]
cmp eax,edx
je near ptr M00_L09
cmp r10d,edx
je near ptr M00_L09
movzx edx,byte ptr [rcx+r11+5]
cmp eax,edx
je near ptr M00_L10
cmp r10d,edx
je near ptr M00_L10
movzx edx,byte ptr [rcx+r11+6]
cmp eax,edx
je near ptr M00_L11
cmp r10d,edx
je near ptr M00_L11
movzx edx,byte ptr [rcx+r11+7]
cmp eax,edx
je near ptr M00_L12
cmp r10d,edx
je near ptr M00_L12
add r11,8
M00_L01:
cmp rsi,8
jae near ptr M00_L00
cmp rsi,4
jb short M00_L02
add rsi,0FFFFFFFFFFFFFFFC
movzx edx,byte ptr [rcx+r11]
cmp eax,edx
je short M00_L05
cmp r10d,edx
je short M00_L05
movzx edx,byte ptr [rcx+r11+1]
cmp eax,edx
je short M00_L06
cmp r10d,edx
je short M00_L06
movzx edx,byte ptr [rcx+r11+2]
cmp eax,edx
je short M00_L07
cmp r10d,edx
je short M00_L07
movzx edx,byte ptr [rcx+r11+3]
cmp eax,edx
je short M00_L08
cmp r10d,edx
je short M00_L08
add r11,4
M00_L02:
test rsi,rsi
je short M00_L04
M00_L03:
movzx edx,byte ptr [rcx+r11]
cmp eax,edx
je short M00_L05
cmp r10d,edx
je short M00_L05
inc r11
dec rsi
jne short M00_L03
M00_L04:
mov eax,0FFFFFFFF
vzeroupper
pop rsi
ret
M00_L05:
mov eax,r11d
jmp near ptr M00_L20
M00_L06:
lea eax,[r11+1]
jmp near ptr M00_L20
M00_L07:
lea eax,[r11+2]
jmp near ptr M00_L20
M00_L08:
lea eax,[r11+3]
jmp near ptr M00_L20
M00_L09:
lea eax,[r11+4]
jmp near ptr M00_L20
M00_L10:
lea eax,[r11+5]
jmp near ptr M00_L20
M00_L11:
lea eax,[r11+6]
jmp near ptr M00_L20
M00_L12:
lea eax,[r11+7]
jmp near ptr M00_L20
M00_L13:
cmp rsi,10
jb short M00_L16
vmovd xmm0,edx
vpbroadcastb ymm0,xmm0
vmovd xmm1,r8d
vpbroadcastb ymm1,xmm1
add rsi,0FFFFFFFFFFFFFFF0
je short M00_L15
M00_L14:
vmovupd ymm2,[rcx+r11]
vpcmpeqb ymm3,ymm0,ymm2
vpcmpeqb ymm2,ymm1,ymm2
vpor ymm2,ymm3,ymm2
vpmovmskb eax,ymm2
test eax,eax
jne near ptr M00_L19
add r11,20
cmp rsi,r11
ja short M00_L14
M00_L15:
vmovupd ymm2,[rcx+rsi]
mov r11,rsi
vpcmpeqb ymm0,ymm0,ymm2
vpcmpeqb ymm1,ymm1,ymm2
vpor ymm0,ymm0,ymm1
vpmovmskb eax,ymm0
test eax,eax
jne short M00_L19
jmp near ptr M00_L04
M00_L16:
vmovd xmm0,edx
vpbroadcastb xmm0,xmm0
vmovd xmm1,r8d
vpbroadcastb xmm1,xmm1
test rsi,rsi
je short M00_L18
M00_L17:
vmovupd xmm2,[rcx+r11]
vpcmpeqb xmm3,xmm0,xmm2
vpcmpeqb xmm2,xmm1,xmm2
vpor xmm2,xmm3,xmm2
vpmovmskb eax,xmm2
test eax,eax
jne short M00_L19
add r11,10
cmp rsi,r11
ja short M00_L17
M00_L18:
vmovupd xmm2,[rcx+rsi]
mov r11,rsi
vpcmpeqb xmm0,xmm0,xmm2
vpcmpeqb xmm1,xmm1,xmm2
vpor xmm0,xmm0,xmm1
vpmovmskb eax,xmm0
test eax,eax
je near ptr M00_L04
M00_L19:
tzcnt eax,eax
add r11,rax
jmp near ptr M00_L05
M00_L20:
vzeroupper
pop rsi
ret
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
int 3
sbb [rcx],eax
add [rax],eax
add [rax],esp
; Total bytes of code 663 Codegen after: .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2; System.SpanHelpers.IndexOfAny(Byte ByRef, Byte, Byte, Int32)
vzeroupper
xor r10d,r10d
cmp r9d,20
jl short M00_L02
nop dword ptr [rax]
mov eax,r9d
and eax,0FFFFFFE0
movzx r11d,dl
vmovd xmm0,r11d
vpbroadcastb ymm0,xmm0
movzx r11d,r8b
vmovd xmm1,r11d
vpbroadcastb ymm1,xmm1
M00_L00:
vmovdqu ymm2,ymmword ptr [rcx+r10]
vpcmpeqb ymm3,ymm0,ymm2
vpcmpeqb ymm2,ymm1,ymm2
vpor ymm2,ymm3,ymm2
vptest ymm2,ymm2
jne short M00_L01
add r10,20
cmp r10,rax
jb short M00_L00
jmp short M00_L05
nop dword ptr [rax]
M00_L01:
vpmovmskb r9d,ymm2
xor eax,eax
tzcnt eax,r9d
add eax,r10d
jmp near ptr M00_L09
M00_L02:
cmp r9d,10
jl short M00_L05
mov eax,r9d
and eax,0FFFFFFF0
movzx r11d,dl
vmovd xmm0,r11d
vpbroadcastb xmm0,xmm0
movzx r11d,r8b
vmovd xmm1,r11d
vpbroadcastb xmm1,xmm1
M00_L03:
vmovdqu xmm2,xmmword ptr [rcx+r10]
vpcmpeqb xmm3,xmm0,xmm2
vpcmpeqb xmm2,xmm1,xmm2
vpor xmm2,xmm3,xmm2
vptest xmm2,xmm2
jne short M00_L04
add r10,10
cmp r10,rax
jb short M00_L03
jmp short M00_L05
xchg ax,ax
M00_L04:
vpmovmskb edx,xmm2
xor eax,eax
tzcnt eax,edx
add eax,r10d
jmp short M00_L09
M00_L05:
movzx eax,dl
movzx edx,r8b
mov r8d,r9d
cmp r10,r8
jae short M00_L07
M00_L06:
movzx r9d,byte ptr [rcx+r10]
cmp eax,r9d
je short M00_L08
cmp edx,r9d
je short M00_L08
inc r10
cmp r10,r8
jb short M00_L06
M00_L07:
mov eax,0FFFFFFFF
vzeroupper
ret
M00_L08:
mov eax,r10d
M00_L09:
vzeroupper
ret
add [rcx],bl
add [rax],al
add [rax],al
add [rax],al
add [rax],al
add [rax],al
add [rax],al
add [rax],al
add [rax-55C021C],bh
jg short M00_L10
M00_L10:
add [rbp+48],dl
(bad)
; Total bytes of code 291 AVX (x64)BenchmarkDotNet=v0.13.1.1845-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
[Host] : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
Job-YLMNVS : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
Job-DFRNLR : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
EnvironmentVariables=COMPlus_EnableAVX2=0 LaunchCount=9
AdvSIMD(arm64)BenchmarkDotNet=v0.13.1.1845-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22408.5
[Host] : .NET 7.0.0 (7.0.22.40308), Arm64 RyuJIT AdvSIMD
Job-NZXMHK : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-LQCIXE : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
LaunchCount=9
|
I think I have a tentative fix for the linker bug (with the gotos - I couldn't find any other way to get to that state 😉 ). I'll clean it up and if possible we should try it on your previous change... anyway, thanks a lot for finding it! |
Hi @adamsitnik, I'm looking at the char equivalent of IndexOfAny. It's half done (blocked by #73777) but can add as a draft PR if you are going to find it useful. |
|
We don't need it anymore due to #73556 |
It's #73384 again
I've tried to repro #73474 locally and I've failed. Now I want to use the CI to verify if the issue is gone or not.