[RyuJIT] lack of escape analysis makes redundant memory-access of Vector128/256<T> fields #10767

fiigii · 2018-07-25T20:18:28Z

Related to https://github.com/dotnet/coreclr/issues/19116

The C# source code below is from dotnet/coreclr#18839

internal class VectorPacket256
{
    public Vector256<float> Xs;
    public Vector256<float> Ys;
    public Vector256<float> Zs;
    public Vector256<float> Lengths
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get
        {
            return Sqrt(DotProduct(this, this));
        }
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Vector256<float> DotProduct(VectorPacket256 left, VectorPacket256 right)
    {
        var x2 = Multiply(left.Xs, right.Xs);
        var y2 = Multiply(left.Ys, right.Ys);
        var z2 = Multiply(left.Zs, right.Zs);
        return Add(Add(x2, y2), z2);
    }
    ...
}

get_Lengths will be compiled to

vmovupd ymm0, ymmword ptr [rax+0x8]
vmulps ymm0, ymm0, ymmword ptr [rax+0x8]
vmovupd ymm1, ymmword ptr [rax+0x28]
vmulps ymm1, ymm1, ymmword ptr [rax+0x28]
vmovupd ymm2, ymmword ptr [rax+0x48]
mov qword ptr [rbp-0x370], rax
vmulps ymm2, ymm2, ymmword ptr [rax+0x48]
vaddps ymm0, ymm0, ymm1
vaddps ymm0, ymm0, ymm2
vsqrtps ymm0, ymm0

Multiply(this.Xs, this.Xs) generates redundant memory access instructions for Vector256<float> field access. Actually, the codgen could be

vmovupd ymm0, ymmword ptr [rax+0x8]
vmulps ymm0, ymm0, ymm0
vmovupd ymm1, ymmword ptr [rax+0x28]
vmulps ymm1, ymm1, ymm1
vmovupd ymm2, ymmword ptr [rax+0x48]
vmulps ymm2, ymm2, ymm2
vaddps ymm0, ymm0, ymm1
vaddps ymm0, ymm0, ymm2
vsqrtps ymm0, ymm0

This redundant could be eliminated by escape analysis and unwarping VectorPacket256.

A worse example with Vector256<float> field assignments. In the code below, difColor is a local VectorPacket256 object (but dif is returned by another function)

                difColor.Xs = BlendVariable(difColor.Xs, dif.Xs, thingMask);
                difColor.Ys = BlendVariable(difColor.Ys, dif.Ys, thingMask);
                difColor.Zs = BlendVariable(difColor.Zs, dif.Zs, thingMask);

that is compiled to

mov r8, qword ptr [rsp+0x70]
vmovupd ymm0, ymmword ptr [r8+0x8]
vblendvps ymm0, ymm0, ymmword ptr [rax+0x8], ymm11
vmovupd ymmword ptr [r8+0x8], ymm0
vmovupd ymm0, ymmword ptr [r8+0x28]
vblendvps ymm0, ymm0, ymmword ptr [rax+0x28], ymm11
vmovupd ymmword ptr [r8+0x28], ymm0
vmovupd ymm0, ymmword ptr [r8+0x48]
vblendvps ymm0, ymm0, ymmword ptr [rax+0x48], ymm11
mov qword ptr [rsp+0x70], r8
vmovupd ymmword ptr [r8+0x48], ymm0

Ideally, this could be optimized to

vblendvps ymm1, ymm1, ymmword ptr [rax+0x8], ymm11
vblendvps ymm2, ymm2, ymmword ptr [rax+0x28], ymm11
vblendvps ymm3, ymm3, ymmword ptr [rax+0x48], ymm11

category:cq
theme:vector-codegen
skill-level:expert
cost:large

The text was updated successfully, but these errors were encountered:

fiigii · 2018-07-25T20:19:10Z

cc @CarolEidt @AndyAyersMS @erozenfeld @tannergooding

AndyAyersMS · 2024-06-25T16:46:49Z

Codegen for get_Lengths -- redundant loads are gone:

; Method VectorPacket256:get_Lengths():System.Runtime.Intrinsics.Vector256`1[float]:this (FullOpts)
G_M26293_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00

G_M26293_IG02:  ;; offset=0x0000
       vmovups  ymm0, ymmword ptr [rcx]
       vmovaps  ymm1, ymm0
       vmovups  ymm2, ymmword ptr [rcx+0x20]
       vmovaps  ymm3, ymm2
       vmovups  ymm4, ymmword ptr [rcx+0x40]
       vmovaps  ymm5, ymm4
       vmulps   ymm0, ymm1, ymm0
       vmulps   ymm1, ymm3, ymm2
       vaddps   ymm0, ymm1, ymm0
       vmulps   ymm1, ymm5, ymm4
       vaddps   ymm0, ymm1, ymm0
       vsqrtps  ymm0, ymm0
       vmovups  ymmword ptr [rdx], ymm0
       mov      rax, rdx
						;; size=57 bbWeight=1 PerfScore 44.00

G_M26293_IG03:  ;; offset=0x0039
       vzeroupper 
       ret      
						;; size=4 bbWeight=1 PerfScore 2.00
; Total bytes of code: 61

However we've still got a few unnecessary moves.

AndyAyersMS · 2024-06-25T16:54:20Z

Not sure where the second example (vblendvps) is in the code.

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RyuJIT] lack of escape analysis makes redundant memory-access of Vector128/256<T> fields #10767

[RyuJIT] lack of escape analysis makes redundant memory-access of Vector128/256<T> fields #10767

fiigii commented Jul 25, 2018 •

edited by BruceForstall

Loading

fiigii commented Jul 25, 2018

AndyAyersMS commented Jun 25, 2024

AndyAyersMS commented Jun 25, 2024

[RyuJIT] lack of escape analysis makes redundant memory-access of Vector128/256<T> fields #10767

[RyuJIT] lack of escape analysis makes redundant memory-access of Vector128/256<T> fields #10767

Comments

fiigii commented Jul 25, 2018 • edited by BruceForstall Loading

fiigii commented Jul 25, 2018

AndyAyersMS commented Jun 25, 2024

AndyAyersMS commented Jun 25, 2024

fiigii commented Jul 25, 2018 •

edited by BruceForstall

Loading