Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[AArch64] Reduce fuse-literals limit on Apple subtargets #106741

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

citymarina
Copy link
Contributor

On Apple targets it is faster to load from a constant pool instead of using
a 4-move sequence (mov+movk+movk+movk+fmov).

Potentially it could also be good to do this for 3-move sequences, but we are
less confident about that case and leave it as a possible future refinement.

I am not familiar with the behaviour of other AArch64 processors so I am
making this subtarget-specific.

The worst possible case for a double literal goes like:

```
  mov ...
  movk ..., lsl llvm#16
  movk ..., lsl llvm#32
  movk ..., lsl llvm#48
  fmov ...
```

The limit of 5 in the code gives the impression that  `Insn` includes
all instructions including the `fmov`, but that's not true. It only
counts the integer moves. This led me astray on some other work in
this area.
This is for an upcoming change to the threshold on Apple targets
for using a constant pool for FP literals versus building them with
integer moves.

This file is based on literal_pools_float.ll. I tried to bolt on to
the existing test, but it got messy as that file is already testing
a matrix of combinations, so creating this new file instead.
On Apple targets it is faster to load from a constant pool instead of using
a 4-move sequence (mov+movk+movk+movk+fmov).

Potentially it could also be good to do this for 3-move sequences, but we are
less confident about that case and leave it as a possible future refinement.

I am not familiar with the behaviour of other AArch64 processors so I am
making this subtarget-specific.
@llvmbot
Copy link
Member

llvmbot commented Aug 30, 2024

@llvm/pr-subscribers-backend-aarch64

Author: Marina Taylor (citymarina)

Changes

On Apple targets it is faster to load from a constant pool instead of using
a 4-move sequence (mov+movk+movk+movk+fmov).

Potentially it could also be good to do this for 3-move sequences, but we are
less confident about that case and leave it as a possible future refinement.

I am not familiar with the behaviour of other AArch64 processors so I am
making this subtarget-specific.


Full diff: https://github.com/llvm/llvm-project/pull/106741.diff

4 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+9-4)
  • (modified) llvm/lib/Target/AArch64/AArch64Subtarget.cpp (+1)
  • (modified) llvm/lib/Target/AArch64/AArch64Subtarget.h (+3)
  • (added) llvm/test/CodeGen/AArch64/literal_pools_float_apple.ll (+128)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 02390e0a85c0a5..bf3ca8d003cf91 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -11458,12 +11458,17 @@ bool AArch64TargetLowering::isFPImmLegal(const APFloat &Imm, EVT VT,
   if (!IsLegal && (VT == MVT::f64 || VT == MVT::f32)) {
     // The cost is actually exactly the same for mov+fmov vs. adrp+ldr;
     // however the mov+fmov sequence is always better because of the reduced
-    // cache pressure. The timings are still the same if you consider
-    // movw+movk+fmov vs. adrp+ldr (it's one instruction longer, but the
-    // movw+movk is fused). So we limit up to 2 instrdduction at most.
+    // cache pressure. Where targets allow, longer sequences may be possible.
+    // For example, movw+movk+fmov may be comparable to adrp+ldr if the
+    // movw+movk is fused.
     SmallVector<AArch64_IMM::ImmInsnModel, 4> Insn;
     AArch64_IMM::expandMOVImm(ImmInt.getZExtValue(), VT.getSizeInBits(), Insn);
-    unsigned Limit = (OptForSize ? 1 : (Subtarget->hasFuseLiterals() ? 5 : 2));
+    assert(Insn.size() <= 4 &&
+           "Should be able to build any value with at most 4 moves");
+    unsigned Limit = (OptForSize ? 1
+                                 : (Subtarget->hasFuseLiterals()
+                                        ? Subtarget->getFuseLiteralsLimit()
+                                        : 2));
     IsLegal = Insn.size() <= Limit;
   }
 
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
index 32db1e8c2477a8..ad4d5c23a23678 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.cpp
@@ -188,6 +188,7 @@ void AArch64Subtarget::initializeProperties(bool HasMinSize) {
     PrefetchDistance = 280;
     MinPrefetchStride = 2048;
     MaxPrefetchIterationsAhead = 3;
+    FuseLiteralsLimit = 3;
     switch (ARMProcFamily) {
     case AppleA14:
     case AppleA15:
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.h b/llvm/lib/Target/AArch64/AArch64Subtarget.h
index accfb49c6fbe3a..23d7ca263eb399 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.h
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.h
@@ -70,6 +70,7 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
   unsigned MaxBytesForLoopAlignment = 0;
   unsigned MinimumJumpTableEntries = 4;
   unsigned MaxJumpTableSize = 0;
+  unsigned FuseLiteralsLimit = 4;
 
   // ReserveXRegister[i] - X#i is not available as a general purpose register.
   BitVector ReserveXRegister;
@@ -254,6 +255,8 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
     return MinimumJumpTableEntries;
   }
 
+  unsigned getFuseLiteralsLimit() const { return FuseLiteralsLimit; }
+
   /// CPU has TBI (top byte of addresses is ignored during HW address
   /// translation) and OS enables it.
   bool supportsAddressTopByteIgnored() const;
diff --git a/llvm/test/CodeGen/AArch64/literal_pools_float_apple.ll b/llvm/test/CodeGen/AArch64/literal_pools_float_apple.ll
new file mode 100644
index 00000000000000..8b755d9e35c074
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/literal_pools_float_apple.ll
@@ -0,0 +1,128 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -verify-machineinstrs -mtriple=aarch64-none-linux-gnu < %s | FileCheck %s
+; RUN: llc -verify-machineinstrs -mtriple=arm64-apple-macosx -mcpu=apple-m1 < %s | FileCheck %s --check-prefix=APPLE
+
+define dso_local float @float_0mov() {
+; CHECK-LABEL: float_0mov:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    fmov s0, #1.00000000
+; CHECK-NEXT:    ret
+;
+; APPLE-LABEL: float_0mov:
+; APPLE:       ; %bb.0:
+; APPLE-NEXT:    fmov s0, #1.00000000
+; APPLE-NEXT:    ret
+  ret float 1.0
+}
+
+define dso_local float @float_1mov() {
+; CHECK-LABEL: float_1mov:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #2143289344 // =0x7fc00000
+; CHECK-NEXT:    fmov s0, w8
+; CHECK-NEXT:    ret
+;
+; APPLE-LABEL: float_1mov:
+; APPLE:       ; %bb.0:
+; APPLE-NEXT:    mov w8, #2143289344 ; =0x7fc00000
+; APPLE-NEXT:    fmov s0, w8
+; APPLE-NEXT:    ret
+  ret float 0x7FF8000000000000
+}
+
+define dso_local float @float_2mov() {
+; CHECK-LABEL: float_2mov:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov w8, #34952 // =0x8888
+; CHECK-NEXT:    movk w8, #32704, lsl #16
+; CHECK-NEXT:    fmov s0, w8
+; CHECK-NEXT:    ret
+;
+; APPLE-LABEL: float_2mov:
+; APPLE:       ; %bb.0:
+; APPLE-NEXT:    mov w8, #34952 ; =0x8888
+; APPLE-NEXT:    movk w8, #32704, lsl #16
+; APPLE-NEXT:    fmov s0, w8
+; APPLE-NEXT:    ret
+  ret float 0x7FF8111100000000
+}
+
+define dso_local double @double_0mov() {
+; CHECK-LABEL: double_0mov:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    fmov d0, #1.00000000
+; CHECK-NEXT:    ret
+;
+; APPLE-LABEL: double_0mov:
+; APPLE:       ; %bb.0:
+; APPLE-NEXT:    fmov d0, #1.00000000
+; APPLE-NEXT:    ret
+  ret double 1.0
+}
+
+define dso_local double @double_1mov() {
+; CHECK-LABEL: double_1mov:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov x8, #4096 // =0x1000
+; CHECK-NEXT:    fmov d0, x8
+; CHECK-NEXT:    ret
+;
+; APPLE-LABEL: double_1mov:
+; APPLE:       ; %bb.0:
+; APPLE-NEXT:    mov x8, #4096 ; =0x1000
+; APPLE-NEXT:    fmov d0, x8
+; APPLE-NEXT:    ret
+  ret double 0x1000
+}
+
+define dso_local double @double_2mov() {
+; CHECK-LABEL: double_2mov:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov x8, #4096 // =0x1000
+; CHECK-NEXT:    movk x8, #8192, lsl #16
+; CHECK-NEXT:    fmov d0, x8
+; CHECK-NEXT:    ret
+;
+; APPLE-LABEL: double_2mov:
+; APPLE:       ; %bb.0:
+; APPLE-NEXT:    mov x8, #4096 ; =0x1000
+; APPLE-NEXT:    movk x8, #8192, lsl #16
+; APPLE-NEXT:    fmov d0, x8
+; APPLE-NEXT:    ret
+  ret double 0x20001000
+}
+
+define dso_local double @double_3mov() {
+; CHECK-LABEL: double_3mov:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    adrp x8, .LCPI6_0
+; CHECK-NEXT:    ldr d0, [x8, :lo12:.LCPI6_0]
+; CHECK-NEXT:    ret
+;
+; APPLE-LABEL: double_3mov:
+; APPLE:       ; %bb.0:
+; APPLE-NEXT:    mov x8, #4096 ; =0x1000
+; APPLE-NEXT:    movk x8, #8192, lsl #16
+; APPLE-NEXT:    movk x8, #12288, lsl #32
+; APPLE-NEXT:    fmov d0, x8
+; APPLE-NEXT:    ret
+  ret double 0x300020001000
+}
+
+define dso_local double @double_4mov() {
+; CHECK-LABEL: double_4mov:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    adrp x8, .LCPI7_0
+; CHECK-NEXT:    ldr d0, [x8, :lo12:.LCPI7_0]
+; CHECK-NEXT:    ret
+;
+; APPLE-LABEL: double_4mov:
+; APPLE:       ; %bb.0:
+; APPLE-NEXT:  Lloh0:
+; APPLE-NEXT:    adrp x8, lCPI7_0@PAGE
+; APPLE-NEXT:  Lloh1:
+; APPLE-NEXT:    ldr d0, [x8, lCPI7_0@PAGEOFF]
+; APPLE-NEXT:    ret
+; APPLE-NEXT:    .loh AdrpLdr Lloh0, Lloh1
+  ret double 0x4000300020001000
+}

@citymarina
Copy link
Contributor Author

This builds on #106716 and #106731; I'll tidy this up after those land.

@efriedma-quic
Copy link
Collaborator

The primary meaning of FeatureFuseLiterals is that it's profitable to schedule mov+"movk lsl 16", or "movk lsl 32"+"movk lsl 48" adjacent to each other. This is a feature of Cortex-A57 and derived cores, but I've never heard of any other core implementing it. Is that actually relevant to Apple cores? If it isn't, we should drop the flag from the Apple cores, then go from there.

The only reason FeatureFuseLiterals is relevant here is that it's getting used as a proxy for "literals are cheap"... since on an a57, it only takes 2 cycles to load a 64-bit constant using fused moves.

@citymarina
Copy link
Contributor Author

@efriedma-quic Thanks for the feedback. Let me get back to you - apologies for the delay.

@citymarina citymarina marked this pull request as draft September 20, 2024 17:10
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants