[AArch64] Reduce fuse-literals limit on Apple subtargets #106741

citymarina · 2024-08-30T14:35:32Z

On Apple targets it is faster to load from a constant pool instead of using
a 4-move sequence (mov+movk+movk+movk+fmov).

Potentially it could also be good to do this for 3-move sequences, but we are
less confident about that case and leave it as a possible future refinement.

I am not familiar with the behaviour of other AArch64 processors so I am
making this subtarget-specific.

The worst possible case for a double literal goes like: ``` mov ... movk ..., lsl llvm#16 movk ..., lsl llvm#32 movk ..., lsl llvm#48 fmov ... ``` The limit of 5 in the code gives the impression that `Insn` includes all instructions including the `fmov`, but that's not true. It only counts the integer moves. This led me astray on some other work in this area.

This is for an upcoming change to the threshold on Apple targets for using a constant pool for FP literals versus building them with integer moves. This file is based on literal_pools_float.ll. I tried to bolt on to the existing test, but it got messy as that file is already testing a matrix of combinations, so creating this new file instead.

On Apple targets it is faster to load from a constant pool instead of using a 4-move sequence (mov+movk+movk+movk+fmov). Potentially it could also be good to do this for 3-move sequences, but we are less confident about that case and leave it as a possible future refinement. I am not familiar with the behaviour of other AArch64 processors so I am making this subtarget-specific.

llvmbot · 2024-08-30T14:36:03Z

@llvm/pr-subscribers-backend-aarch64

Author: Marina Taylor (citymarina)

Changes

On Apple targets it is faster to load from a constant pool instead of using
a 4-move sequence (mov+movk+movk+movk+fmov).

Potentially it could also be good to do this for 3-move sequences, but we are
less confident about that case and leave it as a possible future refinement.

I am not familiar with the behaviour of other AArch64 processors so I am
making this subtarget-specific.

Full diff: https://github.com/llvm/llvm-project/pull/106741.diff

4 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+9-4)
(modified) llvm/lib/Target/AArch64/AArch64Subtarget.cpp (+1)
(modified) llvm/lib/Target/AArch64/AArch64Subtarget.h (+3)
(added) llvm/test/CodeGen/AArch64/literal_pools_float_apple.ll (+128)

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 02390e0a85c0a5..bf3ca8d003cf91 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -11458,12 +11458,17 @@ bool AArch64TargetLowering::isFPImmLegal(const APFloat &Imm, EVT VT,
   if (!IsLegal && (VT == MVT::f64 || VT == MVT::f32)) {
     // The cost is actually exactly the same for mov+fmov vs. adrp+ldr;
     // however the mov+fmov sequence is always better because of the reduced
-    // cache pressure. The timings are still the same if you consider
-    // movw+movk+fmov vs. adrp+ldr (it's one instruction longer, but the
-    // movw+movk is fused). So we limit up to 2 instrdduction at most.
+    // cache pressure. Where targets allow, longer sequences may be possible.
+    // For example, movw+movk+fmov may be comparable to adrp+ldr if the
+    // movw+movk is fused.
     SmallVector<AArch64_IMM::ImmInsnModel, 4> Insn;
     AArch64_IMM::expandMOVImm(ImmInt.getZExtValue(), VT.getSizeInBits(), Insn);
-    unsigned Limit = (OptForSize ? 1 : (Subtarget->hasFuseLiterals() ? 5 : 2));
+    assert(Insn.size() <= 4 &&
+           "Should be able to build any value with at most 4 moves");
+    unsigned Limit = (OptForSize ? 1
+                                 : (Subtarget->hasFuseLiterals()
+                                        ? Subtarget->getFuseLiteralsLimit()
+                                        : 2));
     IsLegal = Insn.size() <= Limit;
   }
 
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
index 32db1e8c2477a8..ad4d5c23a23678 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
@@ -188,6 +188,7 @@ void AArch64Subtarget::initializeProperties(bool HasMinSize) {
     PrefetchDistance = 280;
     MinPrefetchStride = 2048;
     MaxPrefetchIterationsAhead = 3;
+    FuseLiteralsLimit = 3;
     switch (ARMProcFamily) {
     case AppleA14:
     case AppleA15:
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.h b/llvm/lib/Target/AArch64/AArch64Subtarget.h
index accfb49c6fbe3a..23d7ca263eb399 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.h
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.h
@@ -70,6 +70,7 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
   unsigned MaxBytesForLoopAlignment = 0;
   unsigned MinimumJumpTableEntries = 4;
   unsigned MaxJumpTableSize = 0;
+  unsigned FuseLiteralsLimit = 4;
 
   // ReserveXRegister[i] - X#i is not available as a general purpose register.
   BitVector ReserveXRegister;
@@ -254,6 +255,8 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
     return MinimumJumpTableEntries;
   }
 
+  unsigned getFuseLiteralsLimit() const { return FuseLiteralsLimit; }
+
   /// CPU has TBI (top byte of addresses is ignored during HW address
   /// translation) and OS enables it.
   bool supportsAddressTopByteIgnored() const;
diff --git a/llvm/test/CodeGen/AArch64/literal_pools_float_apple.ll b/llvm/test/CodeGen/AArch64/literal_pools_float_apple.ll
new file mode 100644
index 00000000000000..8b755d9e35c074
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/literal_pools_float_apple.ll
@@ -0,0 +1,128 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -verify-machineinstrs -mtriple=aarch64-none-linux-gnu < %s | FileCheck %s
+; RUN: llc -verify-machineinstrs -mtriple=arm64-apple-macosx -mcpu=apple-m1 < %s | FileCheck %s --check-prefix=APPLE
+
+define dso_local float @float_0mov() {
+; CHECK-LABEL: float_0mov:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    fmov s0, #1.00000000
+; CHECK-NEXT:    ret
+;
+; APPLE-LABEL: float_0mov:
+; APPLE:       ; %bb.0:
+; APPLE-NEXT:    fmov s0, #1.00000000
+; APPLE-NEXT:    ret
+  ret float 1.0
+}
+
+define dso_local float @float_1mov() {
+; CHECK-LABEL: float_1mov:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #2143289344 // =0x7fc00000
+; CHECK-NEXT:    fmov s0, w8
+; CHECK-NEXT:    ret
+;
+; APPLE-LABEL: float_1mov:
+; APPLE:       ; %bb.0:
+; APPLE-NEXT:    mov w8, #2143289344 ; =0x7fc00000
+; APPLE-NEXT:    fmov s0, w8
+; APPLE-NEXT:    ret
+  ret float 0x7FF8000000000000
+}
+
+define dso_local float @float_2mov() {
+; CHECK-LABEL: float_2mov:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #34952 // =0x8888
+; CHECK-NEXT:    movk w8, #32704, lsl #16
+; CHECK-NEXT:    fmov s0, w8
+; CHECK-NEXT:    ret
+;
+; APPLE-LABEL: float_2mov:
+; APPLE:       ; %bb.0:
+; APPLE-NEXT:    mov w8, #34952 ; =0x8888
+; APPLE-NEXT:    movk w8, #32704, lsl #16
+; APPLE-NEXT:    fmov s0, w8
+; APPLE-NEXT:    ret
+  ret float 0x7FF8111100000000
+}
+
+define dso_local double @double_0mov() {
+; CHECK-LABEL: double_0mov:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    fmov d0, #1.00000000
+; CHECK-NEXT:    ret
+;
+; APPLE-LABEL: double_0mov:
+; APPLE:       ; %bb.0:
+; APPLE-NEXT:    fmov d0, #1.00000000
+; APPLE-NEXT:    ret
+  ret double 1.0
+}
+
+define dso_local double @double_1mov() {
+; CHECK-LABEL: double_1mov:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov x8, #4096 // =0x1000
+; CHECK-NEXT:    fmov d0, x8
+; CHECK-NEXT:    ret
+;
+; APPLE-LABEL: double_1mov:
+; APPLE:       ; %bb.0:
+; APPLE-NEXT:    mov x8, #4096 ; =0x1000
+; APPLE-NEXT:    fmov d0, x8
+; APPLE-NEXT:    ret
+  ret double 0x1000
+}
+
+define dso_local double @double_2mov() {
+; CHECK-LABEL: double_2mov:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov x8, #4096 // =0x1000
+; CHECK-NEXT:    movk x8, #8192, lsl #16
+; CHECK-NEXT:    fmov d0, x8
+; CHECK-NEXT:    ret
+;
+; APPLE-LABEL: double_2mov:
+; APPLE:       ; %bb.0:
+; APPLE-NEXT:    mov x8, #4096 ; =0x1000
+; APPLE-NEXT:    movk x8, #8192, lsl #16
+; APPLE-NEXT:    fmov d0, x8
+; APPLE-NEXT:    ret
+  ret double 0x20001000
+}
+
+define dso_local double @double_3mov() {
+; CHECK-LABEL: double_3mov:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    adrp x8, .LCPI6_0
+; CHECK-NEXT:    ldr d0, [x8, :lo12:.LCPI6_0]
+; CHECK-NEXT:    ret
+;
+; APPLE-LABEL: double_3mov:
+; APPLE:       ; %bb.0:
+; APPLE-NEXT:    mov x8, #4096 ; =0x1000
+; APPLE-NEXT:    movk x8, #8192, lsl #16
+; APPLE-NEXT:    movk x8, #12288, lsl #32
+; APPLE-NEXT:    fmov d0, x8
+; APPLE-NEXT:    ret
+  ret double 0x300020001000
+}
+
+define dso_local double @double_4mov() {
+; CHECK-LABEL: double_4mov:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    adrp x8, .LCPI7_0
+; CHECK-NEXT:    ldr d0, [x8, :lo12:.LCPI7_0]
+; CHECK-NEXT:    ret
+;
+; APPLE-LABEL: double_4mov:
+; APPLE:       ; %bb.0:
+; APPLE-NEXT:  Lloh0:
+; APPLE-NEXT:    adrp x8, lCPI7_0@PAGE
+; APPLE-NEXT:  Lloh1:
+; APPLE-NEXT:    ldr d0, [x8, lCPI7_0@PAGEOFF]
+; APPLE-NEXT:    ret
+; APPLE-NEXT:    .loh AdrpLdr Lloh0, Lloh1
+  ret double 0x4000300020001000
+}

citymarina · 2024-08-30T14:36:38Z

This builds on #106716 and #106731; I'll tidy this up after those land.

efriedma-quic · 2024-08-30T18:12:56Z

The primary meaning of FeatureFuseLiterals is that it's profitable to schedule mov+"movk lsl 16", or "movk lsl 32"+"movk lsl 48" adjacent to each other. This is a feature of Cortex-A57 and derived cores, but I've never heard of any other core implementing it. Is that actually relevant to Apple cores? If it isn't, we should drop the flag from the Apple cores, then go from there.

The only reason FeatureFuseLiterals is relevant here is that it's getting used as a proxy for "literals are cheap"... since on an a57, it only takes 2 cycles to load a 64-bit constant using fused moves.

citymarina · 2024-09-06T10:23:32Z

@efriedma-quic Thanks for the feedback. Let me get back to you - apologies for the delay.

citymarina added 5 commits August 30, 2024 12:36

Add message to assert

a39c487

Merge branch 'fuse-literals-2b' into fuse-literals-2c

e7d7c49

llvmbot added the backend:AArch64 label Aug 30, 2024

fhahn requested review from ahmedbougacha, fhahn, davemgreen, sjoerdmeijer and efriedma-quic August 30, 2024 17:49

citymarina marked this pull request as draft September 20, 2024 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AArch64] Reduce fuse-literals limit on Apple subtargets #106741

[AArch64] Reduce fuse-literals limit on Apple subtargets #106741

Uh oh!

citymarina commented Aug 30, 2024

Uh oh!

llvmbot commented Aug 30, 2024

Uh oh!

citymarina commented Aug 30, 2024

Uh oh!

efriedma-quic commented Aug 30, 2024

Uh oh!

citymarina commented Sep 6, 2024

Uh oh!

Uh oh!

[AArch64] Reduce fuse-literals limit on Apple subtargets #106741

Are you sure you want to change the base?

[AArch64] Reduce fuse-literals limit on Apple subtargets #106741

Uh oh!

Conversation

citymarina commented Aug 30, 2024

Uh oh!

llvmbot commented Aug 30, 2024

Uh oh!

citymarina commented Aug 30, 2024

Uh oh!

efriedma-quic commented Aug 30, 2024

Uh oh!

citymarina commented Sep 6, 2024

Uh oh!

Uh oh!