[llvm] Question: What is the correct interpretation of LaneBitmask? (PR #109797)

Tue Sep 24 06:07:37 PDT 2024

https://github.com/sdesmalen-arm created https://github.com/llvm/llvm-project/pull/109797

While enabling sub-reg liveness for AArch64, we ran into a register allocator issue, which leads us to ask the question: What is the correct interpertation of the bits in LaneBitmask ?

(1) The documentation of LaneBitmask suggests that if A == B that A overlaps with B, but suggests nothing that guarantees that A fully overlaps with B.
(2) The code pointed out in this PR makes the assumption that if A == B, that A fully overlaps with B.

AArch64 defines 16-bit H0 as being the larger register of 8-bit B0, e.g.
```
  def B0    : AArch64Reg<0,   "b0">, DwarfRegNum<[64]>;
  ... 

  let SubRegIndices = [bsub] in {
    def H0    : AArch64Reg<0,   "h0", [B0]>, DwarfRegAlias<B0>;
    ..
  }
```
It doesn't define the top bits of H0 as an artificial register (e.g. B0_HI).

In the reproducer the register allocator has to insert a spill. Because 'bsub' (8-bit) and 'hsub' (16-bit), 'ssub' (32-bit), etc. all have the same bitmask, it picks a register class for the COPY instruction using the first subreg-idx that matches. This results in a copy of a 32-bit value, rather than the (desired) full 128-bit value.

If H0 *would* be defined as B0 and B0_HI (and similar for all other registers), then the following condition evaluates true only for the largest subreg index:

  // Early exit if we found a perfect match.
  if (SubRegMask == LaneMask) {
    BestIdx = Idx;
    break;
  }

There are several ways to fix this issue;
* If statement (2) is correct, then we could define the top bits with an artificial B0_HI register. It would also be nice for TableGen to emit some extra information that LLVM can assert, so that enabling subreg liveness doesn't result in spurious errors.
* If statement (1) is correct however, the code (as pointed out in this PR) simply contains a bug that we should fix.

Can someone with more knowledge of the register allocator provide guidance? Thank you!

>From 23581e381e2e357cff8684173edc3c27f980b1f9 Mon Sep 17 00:00:00 2001
From: Sander de Smalen <sander.desmalen at arm.com>
Date: Tue, 24 Sep 2024 13:26:25 +0100
Subject: [PATCH 1/2] Add description and test (with sub-reg liveness enabled)

---
 llvm/lib/CodeGen/TargetRegisterInfo.cpp       | 42 ++++++++++-
 llvm/lib/Target/AArch64/AArch64Subtarget.h    |  1 +
 .../AArch64/regalloc-subregidx-reproducer.ll  | 75 +++++++++++++++++++
 3 files changed, 116 insertions(+), 2 deletions(-)
 create mode 100644 llvm/test/CodeGen/AArch64/regalloc-subregidx-reproducer.ll

diff --git a/llvm/lib/CodeGen/TargetRegisterInfo.cpp b/llvm/lib/CodeGen/TargetRegisterInfo.cpp
index ac9a3d6f0d1a60..910a7b6bc4a0ee 100644
--- a/llvm/lib/CodeGen/TargetRegisterInfo.cpp
+++ b/llvm/lib/CodeGen/TargetRegisterInfo.cpp
@@ -521,26 +521,62 @@ bool TargetRegisterInfo::getCoveringSubRegIndexes(
   SmallVector<unsigned, 8> PossibleIndexes;
   unsigned BestIdx = 0;
   unsigned BestCover = 0;
-
+  unsigned BestSize = 0;
   for (unsigned Idx = 1, E = getNumSubRegIndices(); Idx < E; ++Idx) {
     // Is this index even compatible with the given class?
     if (getSubClassWithSubReg(RC, Idx) != RC)
       continue;
     LaneBitmask SubRegMask = getSubRegIndexLaneMask(Idx);
+
+    // The code below (now disabled) is not correct when the lane mask does not
+    // cover the full sub-register. For example, when 16-bit H0 has only one
+    // subreg B0 of 8-bits, and the high lanes are not defined in TableGen (e.g.
+    // by defining B0 and some artificial B0_HI register), then this function
+    // may return either `bsub` or `hsub`, whereas in this case we'd want it to
+    // return the *largest* (in bithwidth) sub-register index. It is better to
+    // keep looking for the biggest one, but this is rather expensive.
+    //
+    // Before thinking of how to solve this, I first want to ask the
+    // question: What is the meaning of 'LaneBitmask'?
+    //
+    // According to LaneBitmask.h:
+    //
+    //  "The individual bits in a lane mask can't be assigned
+    //   any specific meaning. They can be used to check if two
+    //   sub-register indices overlap."
+    //
+    // However, in the code here the assumption is made that if
+    // 'SubRegMask == LaneMask', then the subregister must overlap *fully*.
+    // This is not the case for the way AArch64 registers are defined
+    // at the moment.
+    //
+    // Which interpretation is correct?
+    //
+    // A) If the meaning is 'if the bits are equal, the sub-registers overlap but
+    //    not necessarily fully', then we should fix the code in this function (in
+    //    a better way than just disabling it).
+    //
+    // B) If the meaning is 'if the bits are equal, the sub-registers overlap
+    //    fully', then we can define the high bits with an artificial register.
+
+#if 1
     // Early exit if we found a perfect match.
     if (SubRegMask == LaneMask) {
       BestIdx = Idx;
       break;
     }
+#endif
 
     // The index must not cover any lanes outside \p LaneMask.
     if ((SubRegMask & ~LaneMask).any())
       continue;
 
     unsigned PopCount = SubRegMask.getNumLanes();
+    unsigned Size = getSubRegIdxSize(Idx);
     PossibleIndexes.push_back(Idx);
-    if (PopCount > BestCover) {
+    if ((!BestCover || PopCount >= BestCover) && Size >= BestSize) {
       BestCover = PopCount;
+      BestSize = Size;
       BestIdx = Idx;
     }
   }
@@ -559,6 +595,8 @@ bool TargetRegisterInfo::getCoveringSubRegIndexes(
     int BestCover = std::numeric_limits<int>::min();
     for (unsigned Idx : PossibleIndexes) {
       LaneBitmask SubRegMask = getSubRegIndexLaneMask(Idx);
+      // FIXME: similar issue as above.
+
       // Early exit if we found a perfect match.
       if (SubRegMask == LanesLeft) {
         BestIdx = Idx;
diff --git a/llvm/lib/Target/AArch64/AArch64Subtarget.h b/llvm/lib/Target/AArch64/AArch64Subtarget.h
index accfb49c6fbe3a..338afb0cf1962a 100644
--- a/llvm/lib/Target/AArch64/AArch64Subtarget.h
+++ b/llvm/lib/Target/AArch64/AArch64Subtarget.h
@@ -152,6 +152,7 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
   const Triple &getTargetTriple() const { return TargetTriple; }
   bool enableMachineScheduler() const override { return true; }
   bool enablePostRAScheduler() const override { return usePostRAScheduler(); }
+  bool enableSubRegLiveness() const override { return true; }
 
   bool enableMachinePipeliner() const override;
   bool useDFAforSMS() const override { return false; }
diff --git a/llvm/test/CodeGen/AArch64/regalloc-subregidx-reproducer.ll b/llvm/test/CodeGen/AArch64/regalloc-subregidx-reproducer.ll
new file mode 100644
index 00000000000000..7370b7fc0c42c6
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/regalloc-subregidx-reproducer.ll
@@ -0,0 +1,75 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s | FileCheck %s
+
+target triple = "aarch64"
+
+define void @reproducer(ptr %ptr, ptr %ptr2, <8 x i32> %vec.arg) {
+; CHECK-LABEL: reproducer:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    sub sp, sp, #112
+; CHECK-NEXT:    stp x29, x30, [sp, #80] // 16-byte Folded Spill
+; CHECK-NEXT:    stp x20, x19, [sp, #96] // 16-byte Folded Spill
+; CHECK-NEXT:    .cfi_def_cfa_offset 112
+; CHECK-NEXT:    .cfi_offset w19, -8
+; CHECK-NEXT:    .cfi_offset w20, -16
+; CHECK-NEXT:    .cfi_offset w30, -24
+; CHECK-NEXT:    .cfi_offset w29, -32
+; CHECK-NEXT:    mov x8, xzr
+; CHECK-NEXT:    stp q0, q1, [sp, #16] // 32-byte Folded Spill
+; CHECK-NEXT:    mov x19, x1
+; CHECK-NEXT:    ld2 { v0.4s, v1.4s }, [x8]
+; CHECK-NEXT:    add x8, sp, #48
+; CHECK-NEXT:    mov x20, x0
+; CHECK-NEXT:    st1 { v0.2d, v1.2d }, [x8] // 32-byte Folded Spill
+; CHECK-NEXT:    bl bar
+; CHECK-NEXT:    ldp q1, q0, [sp, #16] // 32-byte Folded Reload
+; CHECK-NEXT:    mov w8, #1 // =0x1
+; CHECK-NEXT:    uzp2 v0.4s, v1.4s, v0.4s
+; CHECK-NEXT:    dup v1.2d, x8
+; CHECK-NEXT:    add x8, sp, #48
+; CHECK-NEXT:    ld1 { v5.2d, v6.2d }, [x8] // 32-byte Folded Reload
+; CHECK-NEXT:    ushll2 v2.2d, v6.4s, #1
+; CHECK-NEXT:    ushll v3.2d, v6.2s, #1
+; CHECK-NEXT:    fmov s6, s5                     // <---- this is wrong, it should copy the full 128-bit vector, rather than 32-bit subreg
+; CHECK-NEXT:    ushll v4.2d, v5.2s, #0
+; CHECK-NEXT:    ushll2 v5.2d, v0.4s, #1
+; CHECK-NEXT:    ushll v0.2d, v0.2s, #1
+; CHECK-NEXT:    ushll2 v6.2d, v6.4s, #0
+; CHECK-NEXT:    orr v3.16b, v3.16b, v4.16b
+; CHECK-NEXT:    orr v0.16b, v0.16b, v1.16b
+; CHECK-NEXT:    orr v2.16b, v2.16b, v6.16b
+; CHECK-NEXT:    stp q0, q3, [sp, #32] // 32-byte Folded Spill
+; CHECK-NEXT:    orr v0.16b, v5.16b, v1.16b
+; CHECK-NEXT:    stp q0, q2, [sp] // 32-byte Folded Spill
+; CHECK-NEXT:    bl bar
+; CHECK-NEXT:    ldr q1, [sp, #16] // 16-byte Folded Reload
+; CHECK-NEXT:    ldr q0, [sp, #48] // 16-byte Folded Reload
+; CHECK-NEXT:    ldp x29, x30, [sp, #80] // 16-byte Folded Reload
+; CHECK-NEXT:    stp q0, q1, [x20]
+; CHECK-NEXT:    ldr q1, [sp] // 16-byte Folded Reload
+; CHECK-NEXT:    ldr q0, [sp, #32] // 16-byte Folded Reload
+; CHECK-NEXT:    stp q0, q1, [x19]
+; CHECK-NEXT:    ldp x20, x19, [sp, #96] // 16-byte Folded Reload
+; CHECK-NEXT:    add sp, sp, #112
+; CHECK-NEXT:    ret
+entry:
+  %wide.vec = load <8 x i32>, ptr null, align 4
+  call void @bar()
+  %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
+  %strided.vec2 = shufflevector <8 x i32> %wide.vec, <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
+  %strided.vec3 = shufflevector <8 x i32> %vec.arg, <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
+  %1 = zext <4 x i32> %strided.vec2 to <4 x i64>
+  %2 = zext <4 x i32> %strided.vec3 to <4 x i64>
+  %3 = shl <4 x i64> %1, <i64 1, i64 1, i64 1, i64 1>
+  %4 = shl <4 x i64> %2, <i64 1, i64 1, i64 1, i64 1>
+  %5 = zext <4 x i32> %strided.vec to <4 x i64>
+  %6 = or <4 x i64> %3, %5
+  %7 = or <4 x i64> %4, <i64 1, i64 1, i64 1, i64 1>
+  call void @bar()
+  store <4 x i64> %6, ptr %ptr, align 8
+  store <4 x i64> %7, ptr %ptr2, align 8
+  ret void
+}
+
+declare void @bar()
+

>From 0f1ed85616670fe321918f199a9e9baedc50d578 Mon Sep 17 00:00:00 2001
From: Sander de Smalen <sander.desmalen at arm.com>
Date: Tue, 24 Sep 2024 13:27:16 +0100
Subject: [PATCH 2/2] Comment out code that assumes 'A == B' means 'A fully
 overlaps with B'.

---
 llvm/lib/CodeGen/TargetRegisterInfo.cpp                    | 2 +-
 llvm/test/CodeGen/AArch64/regalloc-subregidx-reproducer.ll | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/llvm/lib/CodeGen/TargetRegisterInfo.cpp b/llvm/lib/CodeGen/TargetRegisterInfo.cpp
index 910a7b6bc4a0ee..e62cdf7765c4f1 100644
--- a/llvm/lib/CodeGen/TargetRegisterInfo.cpp
+++ b/llvm/lib/CodeGen/TargetRegisterInfo.cpp
@@ -559,7 +559,7 @@ bool TargetRegisterInfo::getCoveringSubRegIndexes(
     // B) If the meaning is 'if the bits are equal, the sub-registers overlap
     //    fully', then we can define the high bits with an artificial register.
 
-#if 1
+#if 0
     // Early exit if we found a perfect match.
     if (SubRegMask == LaneMask) {
       BestIdx = Idx;
diff --git a/llvm/test/CodeGen/AArch64/regalloc-subregidx-reproducer.ll b/llvm/test/CodeGen/AArch64/regalloc-subregidx-reproducer.ll
index 7370b7fc0c42c6..c7e96a68779562 100644
--- a/llvm/test/CodeGen/AArch64/regalloc-subregidx-reproducer.ll
+++ b/llvm/test/CodeGen/AArch64/regalloc-subregidx-reproducer.ll
@@ -30,12 +30,12 @@ define void @reproducer(ptr %ptr, ptr %ptr2, <8 x i32> %vec.arg) {
 ; CHECK-NEXT:    ld1 { v5.2d, v6.2d }, [x8] // 32-byte Folded Reload
 ; CHECK-NEXT:    ushll2 v2.2d, v6.4s, #1
 ; CHECK-NEXT:    ushll v3.2d, v6.2s, #1
-; CHECK-NEXT:    fmov s6, s5                     // <---- this is wrong, it should copy the full 128-bit vector, rather than 32-bit subreg
 ; CHECK-NEXT:    ushll v4.2d, v5.2s, #0
+; CHECK-NEXT:    mov v6.16b, v5.16b
 ; CHECK-NEXT:    ushll2 v5.2d, v0.4s, #1
 ; CHECK-NEXT:    ushll v0.2d, v0.2s, #1
-; CHECK-NEXT:    ushll2 v6.2d, v6.4s, #0
 ; CHECK-NEXT:    orr v3.16b, v3.16b, v4.16b
+; CHECK-NEXT:    ushll2 v6.2d, v6.4s, #0
 ; CHECK-NEXT:    orr v0.16b, v0.16b, v1.16b
 ; CHECK-NEXT:    orr v2.16b, v2.16b, v6.16b
 ; CHECK-NEXT:    stp q0, q3, [sp, #32] // 32-byte Folded Spill