[llvm] [AArch64][SVE] Avoid movprfx by reusing register for _UNDEF pseudos. (PR #166926)

Mon Nov 10 10:38:27 PST 2025

================
@@ -1123,24 +1123,83 @@ unsigned AArch64RegisterInfo::getRegPressureLimit(const TargetRegisterClass *RC,
   }
 }
 
-// FORM_TRANSPOSED_REG_TUPLE nodes are created to improve register allocation
-// where a consecutive multi-vector tuple is constructed from the same indices
-// of multiple strided loads. This may still result in unnecessary copies
-// between the loads and the tuple. Here we try to return a hint to assign the
-// contiguous ZPRMulReg starting at the same register as the first operand of
-// the pseudo, which should be a subregister of the first strided load.
+// We add regalloc hints for different cases:
+// * Choosing a better destination operand for predicated SVE instructions
+//   where the inactive lanes are undef, by choosing a register that is not
+//   unique to the other operands of the instruction.
 //
-// For example, if the first strided load has been assigned $z16_z20_z24_z28
-// and the operands of the pseudo are each accessing subregister zsub2, we
-// should look through through Order to find a contiguous register which
-// begins with $z24 (i.e. $z24_z25_z26_z27).
+// * Improve register allocation for SME multi-vector instructions where we can
+//   benefit from the strided- and contiguous register multi-vector tuples.
 //
+//   Here FORM_TRANSPOSED_REG_TUPLE nodes are created to improve register
+//   allocation where a consecutive multi-vector tuple is constructed from the
+//   same indices of multiple strided loads. This may still result in
+//   unnecessary copies between the loads and the tuple. Here we try to return a
+//   hint to assign the contiguous ZPRMulReg starting at the same register as
+//   the first operand of the pseudo, which should be a subregister of the first
+//   strided load.
+//
+//   For example, if the first strided load has been assigned $z16_z20_z24_z28
+//   and the operands of the pseudo are each accessing subregister zsub2, we
+//   should look through through Order to find a contiguous register which
+//   begins with $z24 (i.e. $z24_z25_z26_z27).
 bool AArch64RegisterInfo::getRegAllocationHints(
     Register VirtReg, ArrayRef<MCPhysReg> Order,
     SmallVectorImpl<MCPhysReg> &Hints, const MachineFunction &MF,
     const VirtRegMap *VRM, const LiveRegMatrix *Matrix) const {
-
   auto &ST = MF.getSubtarget<AArch64Subtarget>();
+  const AArch64InstrInfo *TII =
+      MF.getSubtarget<AArch64Subtarget>().getInstrInfo();
+  const MachineRegisterInfo &MRI = MF.getRegInfo();
+
+  // For predicated SVE instructions where the inactive lanes are undef,
+  // pick a destination register that is not unique to avoid introducing
+  // a movprfx.
+  const TargetRegisterClass *RegRC = MRI.getRegClass(VirtReg);
+  if (AArch64::ZPRRegClass.hasSubClassEq(RegRC)) {
+    for (const MachineOperand &DefOp : MRI.def_operands(VirtReg)) {
+      const MachineInstr &Def = *DefOp.getParent();
+      if (DefOp.isImplicit() ||
+          (TII->get(Def.getOpcode()).TSFlags & AArch64::FalseLanesMask) !=
+              AArch64::FalseLanesUndef)
+        continue;
+
+      for (MCPhysReg R : Order) {
+        auto AddHintIfSuitable = [&](MCPhysReg R, const MachineOperand &MO) {
+          // R is a suitable register hint if there exists an operand for the
+          // instruction that is not yet allocated a register or if R matches
+          // one of the other source operands.
+          if (!VRM->hasPhys(MO.getReg()) || VRM->getPhys(MO.getReg()) == R)
+            Hints.push_back(R);
+        };
+
+        unsigned Opcode = AArch64::getSVEPseudoMap(Def.getOpcode());
+        switch (TII->get(Opcode).TSFlags & AArch64::DestructiveInstTypeMask) {
+        default:
+          break;
+        case AArch64::DestructiveTernaryCommWithRev:
+          AddHintIfSuitable(R, Def.getOperand(2));
+          AddHintIfSuitable(R, Def.getOperand(3));
+          AddHintIfSuitable(R, Def.getOperand(4));
+          break;
+        case AArch64::DestructiveBinaryComm:
+        case AArch64::DestructiveBinaryCommWithRev:
+          AddHintIfSuitable(R, Def.getOperand(2));
+          AddHintIfSuitable(R, Def.getOperand(3));
+          break;
+        case AArch64::DestructiveBinary:
+        case AArch64::DestructiveBinaryImm:
+          AddHintIfSuitable(R, Def.getOperand(2));
+          break;
+        }
+      }
+    }
+
+    if (Hints.size())
+      return TargetRegisterInfo::getRegAllocationHints(VirtReg, Order, Hints,
+                                                       MF, VRM);
----------------
rj-jesus wrote:

I believe you're right, which is why I expected copy hints to come first. A missed copy hint is likely to lead to a MOV down the line, whereas a missed MOVPRFX hint should only lead to the MOVPRFX itself (which should be cheaper). That would happen in the example below if MachineCP weren't able to rewrite `$z0` with `$z4`. 

For what it's worth, the patch does seem to increase the list of hints of affected pseudos considerably, including adding repeated ones ([example](https://godbolt.org/z/3vPEPjK6o)):
```
selectOrSplit ZPR:%4 [80r,96r:0) 0 at 80r  weight:INF
hints: $z0 $z0 $z0 $z1 $z1 $z1 $z2 $z2 $z2 $z3 $z3 $z3 $z4 $z4 $z4 $z5 $z5 $z5 $z6 $z6 $z6 $z7 $z7 $z7 $z16 $z16 $z16 $z17 $z17 $z17 $z18 $z18 $z18 $z19 $z19 $z19 $z20 $z20 $z20 $z21 $z21 $z21 $z22 $z22 $z22 $z23 $z23 $z23 $z24 $z24 $z24 $z25 $z25 $z25 $z26 $z26 $z26 $z27 $z27 $z27 $z28 $z28 $z28 $z29 $z29 $z29 $z30 $z30 $z30 $z31 $z31 $z31 $z8 $z8 $z8 $z9 $z9 $z9 $z10 $z10 $z10 $z11 $z11 $z11 $z12 $z12 $z12 $z13 $z13 $z13 $z14 $z14 $z14 $z15 $z15 $z15 $z4
assigning %4 to $z0: B0 [80r,96r:0) 0 at 80r B0_HI [80r,96r:0) 0 at 80r H0_HI [80r,96r:0) 0 at 80r S0_HI [80r,96r:0) 0 at 80r D0_HI [80r,96r:0) 0 at 80r Q0_HI [80r,96r:0) 0 at 80r
```
Before the patch:
```
selectOrSplit ZPR:%4 [80r,96r:0) 0 at 80r  weight:INF
hints: $z4
assigning %4 to $z4: B4 [80r,96r:0) 0 at 80r B4_HI [80r,96r:0) 0 at 80r H4_HI [80r,96r:0) 0 at 80r S4_HI [80r,96r:0) 0 at 80r D4_HI [80r,96r:0) 0 at 80r Q4_HI [80r,96r:0) 0 at 80r
```

I'm not sure how this affects the register allocator (or compile time), but since it has already been merged, I suppose we can keep an eye out for any issues. :)

https://github.com/llvm/llvm-project/pull/166926