[llvm-branch-commits] [llvm] [InlineSpiller][AMDGPU] Implement subreg reload during RA spill (PR #175002)

Thu Jan 29 00:49:38 PST 2026

================
@@ -1248,18 +1249,62 @@ void InlineSpiller::spillAroundUses(Register Reg) {
 
     // Create a new virtual register for spill/fill.
     // FIXME: Infer regclass from instruction alone.
-    Register NewVReg = Edit->createFrom(Reg);
+
+    unsigned SubReg = 0;
+    LaneBitmask CoveringLanes = LaneBitmask::getNone();
+    // If the subreg liveness is enabled, identify the subreg use(s) to try
+    // subreg reload. Skip if the instruction also defines the register.
+    // For copy bundles, get the covering lane masks.
+    if (MRI.subRegLivenessEnabled() && !RI.Writes) {
+      for (auto [MI, OpIdx] : Ops) {
+        const MachineOperand &MO = MI->getOperand(OpIdx);
+        assert(MO.isReg() && MO.getReg() == Reg);
+        if (MO.isUse()) {
+          SubReg = MO.getSubReg();
+          if (SubReg)
+            CoveringLanes |= TRI.getSubRegIndexLaneMask(SubReg);
+        }
+      }
+    }
+
+    if (MI.isBundled() && CoveringLanes.any()) {
+      CoveringLanes = LaneBitmask(bit_ceil(CoveringLanes.getAsInteger()) - 1);
+      // Obtain the covering subregister index, including any missing indices
+      // within the identified small range. Although this may be suboptimal due
+      // to gaps in the subregisters that are not part of the copy bundle, it is
+      // benificial when components outside this range of the original tuple can
+      // be completely skipped from the reload.
+      SubReg = TRI.getSubRegIdxFromLaneMask(CoveringLanes);
+    }
+
+    // If the target doesn't support subreg reload, fallback to restoring the
+    // full tuple.
+    if (SubReg && !TRI.shouldEnableSubRegReload(SubReg))
+      SubReg = 0;
+
+    const TargetRegisterClass *OrigRC = MRI.getRegClass(Reg);
+    const TargetRegisterClass *NewRC =
+        SubReg ? TRI.getSubRegisterClass(OrigRC, SubReg) : nullptr;
----------------
cdevadas wrote:

The subreg reload brings two advantages. 
1. Currently, when a tuple is reloaded, the full tuple becomes live at the reload point, even if only a subset of its components is actually needed. On targets like AMDGPU, this creates difficulties later during the expansion of the reload pseudo-instruction into individual reload operations, because the unused or undefined subregisters still appear live. They are often patched with ad hoc fixups such as inserting implicit-def or implicit operands for the unneeded tuple components to avoid miscompilations. The subreg reload fixes this broken liveness info for partial uses of tuples chosen for spilling. It avoids introducing spurious undef subregs and eliminates the need for such hacky post-RA workarounds.
2. Trimming down the registers really helps improve the allocation.  Instead of the full tuple, we ensure RA reloads only the relevant subregs.

It is not clear to me how RA will see `%reload = INSERT_SUBREG undef, ..` (the one you suggested). We may miss the two advantages I mentioned here.

https://github.com/llvm/llvm-project/pull/175002