[llvm-branch-commits] [llvm] AMDGPU/GlobalISel: Temporal divergence lowering i1 (PR #124299)

Sun Feb 2 09:44:18 PST 2025

================
@@ -219,6 +220,54 @@ bool DivergenceLoweringHelper::lowerTemporalDivergence() {
   return false;
 }
 
+bool DivergenceLoweringHelper::lowerTemporalDivergenceI1() {
+  MachineRegisterInfo::VRegAttrs BoolS1 = {ST->getBoolRC(), LLT::scalar(1)};
+  initializeLaneMaskRegisterAttributes(BoolS1);
+
+  for (auto [Inst, UseInst, Cycle] : MUI->getTemporalDivergenceList()) {
+    Register Reg = Inst->getOperand(0).getReg();
+    if (MRI->getType(Reg) != LLT::scalar(1))
+      continue;
+
+    Register MergedMask = MRI->createVirtualRegister(BoolS1);
+    Register PrevIterMask = MRI->createVirtualRegister(BoolS1);
+
+    MachineBasicBlock *CycleHeaderMBB = Cycle->getHeader();
+    SmallVector<MachineBasicBlock *, 1> ExitingBlocks;
+    Cycle->getExitingBlocks(ExitingBlocks);
+    assert(ExitingBlocks.size() == 1);
+    MachineBasicBlock *CycleExitingMBB = ExitingBlocks[0];
+
+    B.setInsertPt(*CycleHeaderMBB, CycleHeaderMBB->begin());
+    auto CrossIterPHI = B.buildInstr(AMDGPU::PHI).addDef(PrevIterMask);
+
+    // We only care about cycle iterration path - merge Reg with previous
+    // iteration. For other incomings use implicit def.
+    // Predecessors should be CyclePredecessor and CycleExitingMBB.
+    // In older versions of irreducible control flow lowering there could be
+    // cases with more predecessors. To keep this lowering as generic as
+    // possible also handle those cases.
+    for (auto MBB : CycleHeaderMBB->predecessors()) {
+      if (MBB == CycleExitingMBB) {
+        CrossIterPHI.addReg(MergedMask);
+      } else {
+        B.setInsertPt(*MBB, MBB->getFirstTerminator());
+        auto ImplDef = B.buildInstr(AMDGPU::IMPLICIT_DEF, {BoolS1}, {});
+        CrossIterPHI.addReg(ImplDef.getReg(0));
+      }
+      CrossIterPHI.addMBB(MBB);
+    }
----------------
nhaehnle wrote:

This still feels vaguely off to me.

First of all, can you please add a `Cycle->isReducible()` assert here just in case, since you're relying on there only being one header? (Actually, that assert should probably be in `GenericCycle::getHeader`.)

Second, I'm worried about that case distinction between different kinds of predecessors. It really doesn't feel properly generic to me. Latches and exiting blocks aren't necessarily the same thing. (The current structurizer tends to make it that way, but even with the structurizer we probably don't have a guarantee, let alone with a future different control flow lowering.)

Let's think through some cases:

* If the predecessor is outside the cycle, it should be an `IMPLICIT_DEF`
* If it is inside the cycle, i.e. a latch, there are two cases:
  * If the predecessor is dominated by `Inst`, then clearly use `MergedMask`
  * Otherwise, it's actually not clear: could it be possible that control flow leaves the dominance region of `Inst` and goes around the loop again?

On that last point, you could perhaps make a case that because the loop is temporally divergent, having the CFG in wave CFG form requires a certain structure. But I'm not convinced. In fact, consider:
```mermaid
flowchart TD
    Header --> B
    Header --> Inst
    Inst --> B
    B --> Header
    Inst --> C
    C --> Header
    C --> Exit
```
Let's say the branches of Header and C are uniform; only the branch of the basic block containing Inst is divergent.

Then we have temporal divergence in Inst, and need to handle the `i1`. But the CFG is already *reconvergent*: the divergent branch is post-dominated by one of its immediate successors. This means it is possible to insert EXEC masking code for this CFG correctly without structuring it. Therefore, the code here should really handle this case correctly.

I think this calls for using MachineSSAUpdater for a robust solution:

* Initialize the updater with IMPLICIT_DEF in every predecessor of the header that is outside the cycle.
* Insert the MergedMask into the updater for the `Inst->getParent()`
* Then build the lane mask merging, querying the updater for the previous mask (i.e., let the updater create that register) 
* The MergedMask can still be used by UseInst due to dominance

Except now there's an additional issue with caching the MergedMask that doesn't exist for the non-i1 case: Different `UseInst`s could appear with respect to different `Cycle`s. The non-i1 case doesn't actually care about the cycle structure, but the code here does.

In particular, with my suggestions, it will use the cycle to determine how to initialize the updater.

We can still get away with a single `MergedMask` per `Inst`:

* The lazy way to do it would be to not initialize the updater with IMPLICIT_DEFs and just let it figure them out by itself. This would generate correct code.
  * However, this has a compile-time cost and could end up being too pessimistic in terms of the generated code.
  * The Inst and all its uses could be contained in a cycle. This naive way would lead to a phi node at the header of this cycle, and therefore needlessly extending the live range of the value to the entire outer cycle. In this example, the values we create only need to live from InnerHeader to UseInst, but the naive use of the updater will extend them from OuterHeader all the way through OuterLatch:
```mermaid
flowchart TD
    Start --> OuterHeader --> InnerHeader --> Inst --> InnerHeader
    Inst --> UseInst --> OuterLatch --> OuterHeader
    OuterLatch --> Exit
```
* The smarter way is to initialize the updater in the way I described, using the *largest* relevant cycle.

So instead of merely caching the `MergedMask`, we should collect all `UseInst`s and determine the largest relevant cycle (as returned by the uniformity analysis).

https://github.com/llvm/llvm-project/pull/124299