Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[RyuJIT] lack of escape analysis makes redundant memory-access of Vector128/256<T> fields #10767

Open
fiigii opened this issue Jul 25, 2018 · 3 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Milestone

Comments

@fiigii
Copy link
Contributor

fiigii commented Jul 25, 2018

Related to https://github.com/dotnet/coreclr/issues/19116

The C# source code below is from dotnet/coreclr#18839

internal class VectorPacket256
{
    public Vector256<float> Xs;
    public Vector256<float> Ys;
    public Vector256<float> Zs;
    public Vector256<float> Lengths
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get
        {
            return Sqrt(DotProduct(this, this));
        }
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Vector256<float> DotProduct(VectorPacket256 left, VectorPacket256 right)
    {
        var x2 = Multiply(left.Xs, right.Xs);
        var y2 = Multiply(left.Ys, right.Ys);
        var z2 = Multiply(left.Zs, right.Zs);
        return Add(Add(x2, y2), z2);
    }
    ...
}

get_Lengths will be compiled to

vmovupd ymm0, ymmword ptr [rax+0x8]
vmulps ymm0, ymm0, ymmword ptr [rax+0x8]
vmovupd ymm1, ymmword ptr [rax+0x28]
vmulps ymm1, ymm1, ymmword ptr [rax+0x28]
vmovupd ymm2, ymmword ptr [rax+0x48]
mov qword ptr [rbp-0x370], rax
vmulps ymm2, ymm2, ymmword ptr [rax+0x48]
vaddps ymm0, ymm0, ymm1
vaddps ymm0, ymm0, ymm2
vsqrtps ymm0, ymm0

Multiply(this.Xs, this.Xs) generates redundant memory access instructions for Vector256<float> field access. Actually, the codgen could be

vmovupd ymm0, ymmword ptr [rax+0x8]
vmulps ymm0, ymm0, ymm0
vmovupd ymm1, ymmword ptr [rax+0x28]
vmulps ymm1, ymm1, ymm1
vmovupd ymm2, ymmword ptr [rax+0x48]
vmulps ymm2, ymm2, ymm2
vaddps ymm0, ymm0, ymm1
vaddps ymm0, ymm0, ymm2
vsqrtps ymm0, ymm0

This redundant could be eliminated by escape analysis and unwarping VectorPacket256.

A worse example with Vector256<float> field assignments. In the code below, difColor is a local VectorPacket256 object (but dif is returned by another function)

                difColor.Xs = BlendVariable(difColor.Xs, dif.Xs, thingMask);
                difColor.Ys = BlendVariable(difColor.Ys, dif.Ys, thingMask);
                difColor.Zs = BlendVariable(difColor.Zs, dif.Zs, thingMask);

that is compiled to

mov r8, qword ptr [rsp+0x70]
vmovupd ymm0, ymmword ptr [r8+0x8]
vblendvps ymm0, ymm0, ymmword ptr [rax+0x8], ymm11
vmovupd ymmword ptr [r8+0x8], ymm0
vmovupd ymm0, ymmword ptr [r8+0x28]
vblendvps ymm0, ymm0, ymmword ptr [rax+0x28], ymm11
vmovupd ymmword ptr [r8+0x28], ymm0
vmovupd ymm0, ymmword ptr [r8+0x48]
vblendvps ymm0, ymm0, ymmword ptr [rax+0x48], ymm11
mov qword ptr [rsp+0x70], r8
vmovupd ymmword ptr [r8+0x48], ymm0

Ideally, this could be optimized to

vblendvps ymm1, ymm1, ymmword ptr [rax+0x8], ymm11
vblendvps ymm2, ymm2, ymmword ptr [rax+0x28], ymm11
vblendvps ymm3, ymm3, ymmword ptr [rax+0x48], ymm11

category:cq
theme:vector-codegen
skill-level:expert
cost:large

@fiigii
Copy link
Contributor Author

fiigii commented Jul 25, 2018

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 24, 2020
@AndyAyersMS
Copy link
Member

Codegen for get_Lengths -- redundant loads are gone:

; Method VectorPacket256:get_Lengths():System.Runtime.Intrinsics.Vector256`1[float]:this (FullOpts)
G_M26293_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M26293_IG02:  ;; offset=0x0000
       vmovups  ymm0, ymmword ptr [rcx]
       vmovaps  ymm1, ymm0
       vmovups  ymm2, ymmword ptr [rcx+0x20]
       vmovaps  ymm3, ymm2
       vmovups  ymm4, ymmword ptr [rcx+0x40]
       vmovaps  ymm5, ymm4
       vmulps   ymm0, ymm1, ymm0
       vmulps   ymm1, ymm3, ymm2
       vaddps   ymm0, ymm1, ymm0
       vmulps   ymm1, ymm5, ymm4
       vaddps   ymm0, ymm1, ymm0
       vsqrtps  ymm0, ymm0
       vmovups  ymmword ptr [rdx], ymm0
       mov      rax, rdx
						;; size=57 bbWeight=1 PerfScore 44.00

G_M26293_IG03:  ;; offset=0x0039
       vzeroupper 
       ret      
						;; size=4 bbWeight=1 PerfScore 2.00
; Total bytes of code: 61

However we've still got a few unnecessary moves.

@AndyAyersMS
Copy link
Member

Not sure where the second example (vblendvps) is in the code.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet
Development

No branches or pull requests

4 participants