[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering i1 (PR #124299)

Sun Feb 2 09:48:12 PST 2025

================
@@ -219,6 +220,54 @@ bool DivergenceLoweringHelper::lowerTemporalDivergence() {
   return false;
 }
 
+bool DivergenceLoweringHelper::lowerTemporalDivergenceI1() {
+  MachineRegisterInfo::VRegAttrs BoolS1 = {ST->getBoolRC(), LLT::scalar(1)};
+  initializeLaneMaskRegisterAttributes(BoolS1);
+
+  for (auto [Inst, UseInst, Cycle] : MUI->getTemporalDivergenceList()) {
+    Register Reg = Inst->getOperand(0).getReg();
+    if (MRI->getType(Reg) != LLT::scalar(1))
+      continue;
+
+    Register MergedMask = MRI->createVirtualRegister(BoolS1);
+    Register PrevIterMask = MRI->createVirtualRegister(BoolS1);
+
+    MachineBasicBlock *CycleHeaderMBB = Cycle->getHeader();
+    SmallVector<MachineBasicBlock *, 1> ExitingBlocks;
+    Cycle->getExitingBlocks(ExitingBlocks);
+    assert(ExitingBlocks.size() == 1);
+    MachineBasicBlock *CycleExitingMBB = ExitingBlocks[0];
+
+    B.setInsertPt(*CycleHeaderMBB, CycleHeaderMBB->begin());
+    auto CrossIterPHI = B.buildInstr(AMDGPU::PHI).addDef(PrevIterMask);
+
+    // We only care about cycle iterration path - merge Reg with previous
+    // iteration. For other incomings use implicit def.
+    // Predecessors should be CyclePredecessor and CycleExitingMBB.
+    // In older versions of irreducible control flow lowering there could be
+    // cases with more predecessors. To keep this lowering as generic as
+    // possible also handle those cases.
+    for (auto MBB : CycleHeaderMBB->predecessors()) {
+      if (MBB == CycleExitingMBB) {
+        CrossIterPHI.addReg(MergedMask);
+      } else {
+        B.setInsertPt(*MBB, MBB->getFirstTerminator());
+        auto ImplDef = B.buildInstr(AMDGPU::IMPLICIT_DEF, {BoolS1}, {});
+        CrossIterPHI.addReg(ImplDef.getReg(0));
+      }
+      CrossIterPHI.addMBB(MBB);
+    }
+
+    MachineBasicBlock *MBB = Inst->getParent();
+    buildMergeLaneMasks(*MBB, MBB->getFirstTerminator(), {}, MergedMask,
+                        PrevIterMask, Reg);
----------------
nhaehnle wrote:

In this case, it would be better to move the lane merging to directly after `Inst`.

The register pressure calculus is exactly opposite from the non-i1 case:

* In the non-i1 case, shifting the COPY down is good because it reduces the live range of the VGPR while potentially making the live range of the SGPR longer. VGPRs are more expensive than SGPRs, so this is a good trade-off.
* In the i1 case, we'll have a live range for the merged mask extending across the entire cycle anyway. By moving the lane merging closer to Inst, we leave the live range of the merged mask unchanged, but we (most likely) reduce the live range of the i1 value produced by Inst.

https://github.com/llvm/llvm-project/pull/124299