[llvm] [AMDGPU] Change control flow intrinsic lowering making the wave to re… (PR #86805)

Wed Mar 27 07:14:15 PDT 2024

llvmbot wrote:




@llvm/pr-subscribers-backend-amdgpu

Author: None (alex-t)

<details>
<summary>Changes</summary>

…converge at the end of the predecessor block.

We currently lower the SI_IF/ELSE, SI_LOOP, and SI_END_CF to reconverge the wave at the beginning of the CF join basic block or on the loop exit block. This leads to numerous issues related to the spill/split insertion points. LLVM core kits consider the start of the block as the best point to reload the spilled registers. As a result, the vector loads are incorrectly masked out.  A similar issue arose when the split kit split the live interval on the CF joining block: the spills were inserted before the exec mask was restored.

The current code status is an early draft. The PR is created to serve as a design review and discussion space.
Many LIT tests are XFailed because they do not make sense any longer.
This was done to let it go through PSDB to ensure that the generated code could run correctly.
The rest of the LIT stuff is under construction and is going to be taken seriously as soon as the general approach is approved by the community.

---

Patch is 3.27 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/86805.diff


196 Files Affected:

- (modified) llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp (+15-27) 
- (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+85) 
- (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+1) 
- (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+2-2) 
- (modified) llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp (+131-384) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomic_optimizations_mul_one.ll (+3-1) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+16-12) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+95-53) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+97-49) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+20-12) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll (+5-3) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll (+29-18) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll (+60-28) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-no-rtn.ll (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-rtn.ll (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-atomicrmw.ll (-2) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-function-args.ll (+1-1) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.end.cf.i32.ll (+5-5) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.end.cf.i64.ll (+4-5) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+190-126) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memmove.ll (+9-5) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/localizer.ll (+23-22) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll (+36-25) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll (+28-17) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll (+262-257) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll (+225-218) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+80-73) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+76-69) 
- (modified) llvm/test/CodeGen/AMDGPU/amdpal-callable.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+25-13) 
- (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (+514-342) 
- (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+691-421) 
- (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+1086-686) 
- (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll (+191-119) 
- (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (+466-310) 
- (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (+442-286) 
- (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+123-91) 
- (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-nand.ll (+15-9) 
- (modified) llvm/test/CodeGen/AMDGPU/atomics-cas-remarks-gfx90a.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/bb-prolog-spill-during-regalloc.ll (+19-18) 
- (modified) llvm/test/CodeGen/AMDGPU/block-should-not-be-in-alive-blocks.mir (+13-13) 
- (modified) llvm/test/CodeGen/AMDGPU/branch-condition-and.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll (+428-376) 
- (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation-gfx10-branch-offset-bug.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+30-19) 
- (modified) llvm/test/CodeGen/AMDGPU/bug-sdag-emitcopyfromreg.ll (+7-65) 
- (modified) llvm/test/CodeGen/AMDGPU/bypass-div.ll (+60-36) 
- (modified) llvm/test/CodeGen/AMDGPU/byval-frame-setup.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/call-skip.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-flat.ll (+164-92) 
- (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-gfx1030.ll (+2-1) 
- (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-gfx908.ll (+7-4) 
- (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/codegen-prepare-addrspacecast-non-null.ll (+36-26) 
- (modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.ll (+490-443) 
- (modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.mir (+297-195) 
- (modified) llvm/test/CodeGen/AMDGPU/control-flow-fastregalloc.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/control-flow-optnone.ll (+2) 
- (modified) llvm/test/CodeGen/AMDGPU/convergent-inlineasm.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/cse-convergent.ll (+10-6) 
- (modified) llvm/test/CodeGen/AMDGPU/cse-phi-incoming-val.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/dag-divergence-atomic.ll (+15-9) 
- (modified) llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll (+1-1) 
- (modified) llvm/test/CodeGen/AMDGPU/dagcombine-v1i8-extractvecelt-crash.ll (+6-3) 
- (modified) llvm/test/CodeGen/AMDGPU/div_i128.ll (+169-165) 
- (modified) llvm/test/CodeGen/AMDGPU/divergent-branch-uniform-condition.ll (+19-8) 
- (modified) llvm/test/CodeGen/AMDGPU/elf-header-flags-mach.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/else.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/endcf-loop-header.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/fix-frame-ptr-reg-copy-livein.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+765-444) 
- (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+765-444) 
- (modified) llvm/test/CodeGen/AMDGPU/fneg-combines.f16.ll (+26-16) 
- (modified) llvm/test/CodeGen/AMDGPU/fneg-combines.new.ll (+14-8) 
- (modified) llvm/test/CodeGen/AMDGPU/fold-fabs.ll (+36-16) 
- (modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+60-28) 
- (modified) llvm/test/CodeGen/AMDGPU/frame-index-elimination.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+15-9) 
- (modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-no-rtn.ll (+6-6) 
- (modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-rtn.ll (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/global-atomics-fp-wrong-subtarget.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/global-atomics-fp.ll (+473-252) 
- (modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (+480-240) 
- (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+813-492) 
- (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+765-444) 
- (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+1156-590) 
- (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+753-348) 
- (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+753-348) 
- (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+1234-622) 
- (modified) llvm/test/CodeGen/AMDGPU/hoist-cond.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/i1-copy-from-loop.ll (+19-12) 
- (modified) llvm/test/CodeGen/AMDGPU/i1-copy-phi.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/i1_copy_phi_with_phi_incoming_value.mir (+4-13) 
- (modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+100-89) 
- (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/indirect-call.ll (+13-7) 
- (modified) llvm/test/CodeGen/AMDGPU/infinite-loop.ll (+2) 
- (modified) llvm/test/CodeGen/AMDGPU/inline-asm.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+24-11) 
- (modified) llvm/test/CodeGen/AMDGPU/kill-infinite-loop.ll (+32-21) 
- (modified) llvm/test/CodeGen/AMDGPU/lds-global-non-entry-func.ll (+118-79) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.div.fmas.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.ordered.swap.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.exp.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i32.ll (+14-6) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll (+14-6) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ps.live.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll (+160-108) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll (+160-108) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sendmsg.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.softwqm.ll (+20-12) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll (+198-134) 
- (modified) llvm/test/CodeGen/AMDGPU/local-atomics-fp.ll (+410-254) 
- (modified) llvm/test/CodeGen/AMDGPU/long-branch-reserve-register.ll (+11-8) 
- (modified) llvm/test/CodeGen/AMDGPU/loop-live-out-copy-undef-subrange.ll (+5-3) 
- (modified) llvm/test/CodeGen/AMDGPU/loop-on-function-argument.ll (+7-3) 
- (modified) llvm/test/CodeGen/AMDGPU/loop_break.ll (+32-17) 
- (modified) llvm/test/CodeGen/AMDGPU/loop_exit_with_xor.ll (+34-23) 
- (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-intervals.mir (+73-58) 
- (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-variables-update.mir (+65-81) 
- (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-live-variables-update.xfail.mir (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/lower-control-flow-other-terminators.mir (+39-39) 
- (modified) llvm/test/CodeGen/AMDGPU/lower-i1-copies-clear-kills.mir (+4-8) 
- (modified) llvm/test/CodeGen/AMDGPU/machine-sink-lane-mask.mir (+2-6) 
- (modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.ll (+26-16) 
- (modified) llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir (-4) 
- (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+252-184) 
- (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.mir (-4) 
- (modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+37-22) 
- (modified) llvm/test/CodeGen/AMDGPU/mixed-wave32-wave64.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/move-to-valu-atomicrmw-system.ll (+20-9) 
- (modified) llvm/test/CodeGen/AMDGPU/move-to-valu-atomicrmw.ll (+10-4) 
- (modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands-non-ptr-intrinsics.ll (+50-34) 
- (modified) llvm/test/CodeGen/AMDGPU/mubuf-legalize-operands.ll (+50-34) 
- (modified) llvm/test/CodeGen/AMDGPU/mul24-pass-ordering.ll (+18-9) 
- (modified) llvm/test/CodeGen/AMDGPU/multi-divergent-exit-region.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/multilevel-break.ll (+3-4) 
- (modified) llvm/test/CodeGen/AMDGPU/nested-loop-conditions.ll (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/no-dup-inst-prefetch.ll (+26-13) 
- (modified) llvm/test/CodeGen/AMDGPU/non-entry-alloca.ll (+62-36) 
- (modified) llvm/test/CodeGen/AMDGPU/phi-elimination-end-cf.mir (+1-2) 
- (modified) llvm/test/CodeGen/AMDGPU/rem_i128.ll (+170-166) 
- (modified) llvm/test/CodeGen/AMDGPU/ret_jump.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/scheduler-rp-calc-one-successor-two-predecessors-bug.ll (+11-9) 
- (modified) llvm/test/CodeGen/AMDGPU/sdiv64.ll (+107-79) 
- (modified) llvm/test/CodeGen/AMDGPU/set-inactive-wwm-overwrite.ll (+32-19) 
- (modified) llvm/test/CodeGen/AMDGPU/setcc-sext.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/sgpr-control-flow.ll (+26-19) 
- (modified) llvm/test/CodeGen/AMDGPU/should-not-hoist-set-inactive.ll (+28-15) 
- (modified) llvm/test/CodeGen/AMDGPU/si-annotate-cf-kill.ll (+26-12) 
- (modified) llvm/test/CodeGen/AMDGPU/si-annotate-cf-noloop.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/si-annotate-cf-unreachable.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/si-annotate-cf.ll (+34-20) 
- (modified) llvm/test/CodeGen/AMDGPU/si-annotate-dbg-info.ll (+6-7) 
- (modified) llvm/test/CodeGen/AMDGPU/si-annotate-nested-control-flows.ll (+2) 
- (modified) llvm/test/CodeGen/AMDGPU/si-annotatecfg-multiple-backedges.ll (+5-5) 
- (modified) llvm/test/CodeGen/AMDGPU/si-fix-sgpr-copies.mir (-1) 
- (modified) llvm/test/CodeGen/AMDGPU/si-lower-control-flow-kill.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/si-lower-control-flow-unreachable-block.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/si-lower-control-flow.mir (+61-84) 
- (modified) llvm/test/CodeGen/AMDGPU/si-lower-i1-copies-order-of-phi-incomings.mir (+8-4) 
- (modified) llvm/test/CodeGen/AMDGPU/si-opt-vgpr-liverange-bug-deadlanes.mir (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/si-optimize-vgpr-live-range-dbg-instr.ll (+14-15) 
- (modified) llvm/test/CodeGen/AMDGPU/si-optimize-vgpr-live-range-dbg-instr.mir (+2-1) 
- (modified) llvm/test/CodeGen/AMDGPU/si-unify-exit-multiple-unreachables.ll (+29-51) 
- (modified) llvm/test/CodeGen/AMDGPU/si-unify-exit-return-unreachable.ll (+7-6) 
- (modified) llvm/test/CodeGen/AMDGPU/skip-branch-trap.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+210-130) 
- (modified) llvm/test/CodeGen/AMDGPU/spill-cfg-position.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll (+60-75) 
- (modified) llvm/test/CodeGen/AMDGPU/srem64.ll (+107-79) 
- (modified) llvm/test/CodeGen/AMDGPU/stack-pointer-offset-relative-frameindex.ll (+22-10) 
- (modified) llvm/test/CodeGen/AMDGPU/stacksave_stackrestore.ll (+65-47) 
- (modified) llvm/test/CodeGen/AMDGPU/stale-livevar-in-twoaddr-pass.mir (+1-1) 
- (modified) llvm/test/CodeGen/AMDGPU/stop-tail-duplicate-cfg-intrinsic.mir (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/subreg-coalescer-undef-use.ll (+7-3) 
- (modified) llvm/test/CodeGen/AMDGPU/transform-block-with-return-to-epilog.ll (+38-42) 
- (modified) llvm/test/CodeGen/AMDGPU/tuple-allocation-failure.ll (+114-96) 
- (modified) llvm/test/CodeGen/AMDGPU/udiv64.ll (+112-84) 
- (modified) llvm/test/CodeGen/AMDGPU/uniform-cfg.ll (+54-30) 
- (modified) llvm/test/CodeGen/AMDGPU/uniform-loop-inside-nonuniform.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/uniform-phi-with-undef.ll (+9-7) 
- (modified) llvm/test/CodeGen/AMDGPU/unstructured-cfg-def-use-issue.ll (+63-35) 
- (modified) llvm/test/CodeGen/AMDGPU/urem64.ll (+83-62) 
- (modified) llvm/test/CodeGen/AMDGPU/valu-i1.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/vcmp-saveexec-to-vcmpx.ll (+1) 
- (modified) llvm/test/CodeGen/AMDGPU/vgpr-liverange-ir.ll (+6-6) 
- (modified) llvm/test/CodeGen/AMDGPU/vgpr-liverange.ll (+66-48) 
- (modified) llvm/test/CodeGen/AMDGPU/vgpr-mark-last-scratch-load.ll (+12-8) 
- (modified) llvm/test/CodeGen/AMDGPU/vgpr-spill-placement-issue61083.ll (+16-14) 
- (modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+756-735) 
- (modified) llvm/test/CodeGen/AMDGPU/wave32.ll (+185-93) 
- (modified) llvm/test/CodeGen/AMDGPU/while-break.ll (+68-46) 
- (modified) llvm/test/CodeGen/AMDGPU/wqm.ll (+313-197) 
- (modified) llvm/test/CodeGen/AMDGPU/wwm-reserved-spill.ll (+53-42) 
- (modified) llvm/test/CodeGen/AMDGPU/wwm-reserved.ll (+82-60) 


``````````diff

diff --git a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
index 58214f30bb8d67..31dcfb959e54c0 100644
--- a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
@@ -306,38 +306,26 @@ bool SIAnnotateControlFlow::handleLoop(BranchInst *Term) {
 
 /// Close the last opened control flow
 bool SIAnnotateControlFlow::closeControlFlow(BasicBlock *BB) {
-  llvm::Loop *L = LI->getLoopFor(BB);
 
   assert(Stack.back().first == BB);
 
-  if (L && L->getHeader() == BB) {
-    // We can't insert an EndCF call into a loop header, because it will
-    // get executed on every iteration of the loop, when it should be
-    // executed only once before the loop.
-    SmallVector <BasicBlock *, 8> Latches;
-    L->getLoopLatches(Latches);
-
-    SmallVector<BasicBlock *, 2> Preds;
-    for (BasicBlock *Pred : predecessors(BB)) {
-      if (!is_contained(Latches, Pred))
-        Preds.push_back(Pred);
-    }
-
-    BB = SplitBlockPredecessors(BB, Preds, "endcf.split", DT, LI, nullptr,
-                                false);
-  }
-
   Value *Exec = popSaved();
-  BasicBlock::iterator FirstInsertionPt = BB->getFirstInsertionPt();
-  if (!isa<UndefValue>(Exec) && !isa<UnreachableInst>(FirstInsertionPt)) {
-    Instruction *ExecDef = cast<Instruction>(Exec);
-    BasicBlock *DefBB = ExecDef->getParent();
-    if (!DT->dominates(DefBB, BB)) {
-      // Split edge to make Def dominate Use
-      FirstInsertionPt = SplitEdge(DefBB, BB, DT, LI)->getFirstInsertionPt();
+  Instruction *ExecDef = dyn_cast<Instruction>(Exec);
+  BasicBlock *DefBB = ExecDef->getParent();
+  for (auto Pred : predecessors(BB)) {
+    llvm::Loop *L = LI->getLoopFor(Pred);
+    bool IsLoopLatch = false;
+    if (L) {
+      SmallVector<BasicBlock *, 4> LL;
+      L->getLoopLatches(LL);
+      IsLoopLatch = std::find_if(LL.begin(), LL.end(), [Pred](BasicBlock *B) {
+                      return B == Pred;
+                    }) != LL.end();
+    }
+    if (Pred != DefBB && DT->dominates(DefBB, Pred) && !IsLoopLatch) {
+      BasicBlock::iterator InsPt(Pred->getTerminator());
+      IRBuilder<>(Pred, InsPt).CreateCall(EndCf, {Exec});
     }
-    IRBuilder<>(FirstInsertionPt->getParent(), FirstInsertionPt)
-        .CreateCall(EndCf, {Exec});
   }
 
   return true;
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 9bc1b8eb598f3a..6f94cb16175786 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -15619,6 +15619,91 @@ void SITargetLowering::finalizeLowering(MachineFunction &MF) const {
     }
   }
 
+  // ISel inserts copy to regs for the successor PHIs
+  // at the BB end. We need to move the SI_END_CF right before the branch.
+  // Even we don't have to move SI_END_CF we need to take care of the
+  // S_CBRANCH_SCC0/1 as SI_END_CF overwrites SCC
+  for (auto &MBB : MF) {
+    for (auto &MI : MBB) {
+      if (MI.getOpcode() == AMDGPU::SI_END_CF) {
+        MachineBasicBlock::iterator I(MI);
+        MachineBasicBlock::iterator Next = std::next(I);
+        bool NeedToMove = false;
+        while (Next != MBB.end() && !Next->isBranch()) {
+          NeedToMove = true;
+          Next++;
+        }
+
+        // Lets take care of SCC users as S_END_CF defines SCC
+        bool NeedPreserveSCC =
+            Next != MBB.end() && Next->readsRegister(AMDGPU::SCC);
+        MachineBasicBlock::iterator SCCDefUse(Next);
+        // This loop will be never taken as we always have S_CBRANCH_SCC1/0 at
+        // the end of the block.
+        while (!NeedPreserveSCC && SCCDefUse != MBB.end()) {
+          if (SCCDefUse->definesRegister(AMDGPU::SCC))
+            // This should never happen - SCC def after the branch reading SCC
+            break;
+          if (SCCDefUse->readsRegister(AMDGPU::SCC)) {
+            NeedPreserveSCC = true;
+            break;
+          }
+          SCCDefUse++;
+        }
+        if (NeedPreserveSCC) {
+          MachineBasicBlock::reverse_iterator BackSeeker(Next);
+          while (BackSeeker != MBB.rend()) {
+            if (BackSeeker != MI && BackSeeker->definesRegister(AMDGPU::SCC))
+              break;
+            BackSeeker++;
+          }
+          // we need this to makes some artificial MIR tests happy
+          bool NeedSetSCCUndef = false;
+          if (BackSeeker == MBB.rend()) {
+            // We have reached the begin of the block but haven't seen the SCC
+            // def Given that the MIR is correct, we either have SCC live in
+            // or SCCUser SCC operand is undef. In fact, we don't need to emit
+            // the instructions that preserve thje SCC if the use is Undef. We
+            // do this just because the MIR looks weird otherwise.
+            MachineOperand *SCCUseOp =
+                SCCDefUse->findRegisterUseOperand(AMDGPU::SCC, false, TRI);
+            assert(SCCUseOp);
+            bool IsSCCLiveIn = MBB.isLiveIn(AMDGPU::SCC);
+            bool IsUseUndef = SCCUseOp->isUndef();
+            NeedSetSCCUndef = (!IsSCCLiveIn && IsUseUndef);
+          }
+          MachineBasicBlock::iterator InsPt(BackSeeker);
+          Register SavedSCC =
+              MRI.createVirtualRegister(&AMDGPU::SReg_32RegClass);
+          MachineInstr *SaveSCC =
+              BuildMI(MBB, InsPt, InsPt->getDebugLoc(),
+                      TII->get(AMDGPU::S_CSELECT_B32), SavedSCC)
+                  .addImm(1)
+                  .addImm(0);
+          if (NeedSetSCCUndef) {
+
+            MachineOperand *SCCOp =
+                SaveSCC->findRegisterUseOperand(AMDGPU::SCC, false, TRI);
+            if (SCCOp)
+              SCCOp->setIsUndef();
+          }
+          Register Tmp =
+              MRI.createVirtualRegister(&AMDGPU::SReg_32RegClass);
+          Next = BuildMI(MBB, Next, Next->getDebugLoc(),
+                         TII->get(AMDGPU::S_AND_B32_term), Tmp)
+                     .addReg(SavedSCC)
+                     .addImm(1);
+        }
+
+        if (NeedToMove) {
+          MBB.splice(Next, &MBB, &MI);
+        }
+
+        break;
+      }
+    }
+  }
+
   // FIXME: This is a hack to fixup AGPR classes to use the properly aligned
   // classes if required. Ideally the register class constraints would differ
   // per-subtarget, but there's no easy way to achieve that right now. This is
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index c19c3c6017a7c8..acfa6e88725a94 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -3051,6 +3051,7 @@ bool SIInstrInfo::analyzeBranch(MachineBasicBlock &MBB, MachineBasicBlock *&TBB,
       break;
     case AMDGPU::SI_IF:
     case AMDGPU::SI_ELSE:
+    case AMDGPU::SI_END_CF:
     case AMDGPU::SI_KILL_I1_TERMINATOR:
     case AMDGPU::SI_KILL_F32_COND_IMM_TERMINATOR:
       // FIXME: It's messy that these need to be considered here at all.
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 33c93cdf20c43b..54c0359b5272b3 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -469,8 +469,6 @@ def SI_LOOP : CFPseudoInstSI <
   let IsNeverUniform = 1;
 }
 
-} // End isTerminator = 1
-
 def SI_END_CF : CFPseudoInstSI <
   (outs), (ins SReg_1:$saved), [], 1, 1> {
   let Size = 4;
@@ -482,6 +480,8 @@ def SI_END_CF : CFPseudoInstSI <
   let mayStore = 1;
 }
 
+} // End isTerminator = 1
+
 def SI_IF_BREAK : CFPseudoInstSI <
   (outs SReg_1:$dst), (ins SReg_1:$vcc, SReg_1:$src), []> {
   let Size = 4;
diff --git a/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp b/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
index f178324dbbe246..b5bd2bf02dfab7 100644
--- a/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp
@@ -82,6 +82,9 @@ class SILowerControlFlow : public MachineFunctionPass {
   SmallSet<Register, 8> RecomputeRegs;
 
   const TargetRegisterClass *BoolRC = nullptr;
+  long unsigned TestMask;
+  unsigned Select;
+  unsigned CmovOpc;
   unsigned AndOpc;
   unsigned OrOpc;
   unsigned XorOpc;
@@ -92,16 +95,14 @@ class SILowerControlFlow : public MachineFunctionPass {
   unsigned OrSaveExecOpc;
   unsigned Exec;
 
-  bool EnableOptimizeEndCf = false;
-
-  bool hasKill(const MachineBasicBlock *Begin, const MachineBasicBlock *End);
-
   void emitIf(MachineInstr &MI);
   void emitElse(MachineInstr &MI);
   void emitIfBreak(MachineInstr &MI);
   void emitLoop(MachineInstr &MI);
+  void emitWaveDiverge(MachineInstr &MI, Register EnabledLanesMask,
+                       Register DisableLanesMask);
 
-  MachineBasicBlock *emitEndCf(MachineInstr &MI);
+  void emitEndCf(MachineInstr &MI);
 
   void lowerInitExec(MachineBasicBlock *MBB, MachineInstr &MI);
 
@@ -110,8 +111,6 @@ class SILowerControlFlow : public MachineFunctionPass {
 
   void combineMasks(MachineInstr &MI);
 
-  bool removeMBBifRedundant(MachineBasicBlock &MBB);
-
   MachineBasicBlock *process(MachineInstr &MI);
 
   // Skip to the next instruction, ignoring debug instructions, and trivial
@@ -134,9 +133,6 @@ class SILowerControlFlow : public MachineFunctionPass {
     return I;
   }
 
-  // Remove redundant SI_END_CF instructions.
-  void optimizeEndCf();
-
 public:
   static char ID;
 
@@ -166,205 +162,39 @@ char SILowerControlFlow::ID = 0;
 INITIALIZE_PASS(SILowerControlFlow, DEBUG_TYPE,
                "SI lower control flow", false, false)
 
-static void setImpSCCDefDead(MachineInstr &MI, bool IsDead) {
-  MachineOperand &ImpDefSCC = MI.getOperand(3);
-  assert(ImpDefSCC.getReg() == AMDGPU::SCC && ImpDefSCC.isDef());
-
-  ImpDefSCC.setIsDead(IsDead);
-}
-
 char &llvm::SILowerControlFlowID = SILowerControlFlow::ID;
 
-bool SILowerControlFlow::hasKill(const MachineBasicBlock *Begin,
-                                 const MachineBasicBlock *End) {
-  DenseSet<const MachineBasicBlock*> Visited;
-  SmallVector<MachineBasicBlock *, 4> Worklist(Begin->successors());
-
-  while (!Worklist.empty()) {
-    MachineBasicBlock *MBB = Worklist.pop_back_val();
-
-    if (MBB == End || !Visited.insert(MBB).second)
-      continue;
-    if (KillBlocks.contains(MBB))
-      return true;
-
-    Worklist.append(MBB->succ_begin(), MBB->succ_end());
-  }
-
-  return false;
-}
-
-static bool isSimpleIf(const MachineInstr &MI, const MachineRegisterInfo *MRI) {
-  Register SaveExecReg = MI.getOperand(0).getReg();
-  auto U = MRI->use_instr_nodbg_begin(SaveExecReg);
-
-  if (U == MRI->use_instr_nodbg_end() ||
-      std::next(U) != MRI->use_instr_nodbg_end() ||
-      U->getOpcode() != AMDGPU::SI_END_CF)
-    return false;
-
-  return true;
-}
-
 void SILowerControlFlow::emitIf(MachineInstr &MI) {
   MachineBasicBlock &MBB = *MI.getParent();
   const DebugLoc &DL = MI.getDebugLoc();
   MachineBasicBlock::iterator I(&MI);
-  Register SaveExecReg = MI.getOperand(0).getReg();
-  MachineOperand& Cond = MI.getOperand(1);
+  Register MaskElse = MI.getOperand(0).getReg();
+  MachineOperand &Cond = MI.getOperand(1);
   assert(Cond.getSubReg() == AMDGPU::NoSubRegister);
-
-  MachineOperand &ImpDefSCC = MI.getOperand(4);
-  assert(ImpDefSCC.getReg() == AMDGPU::SCC && ImpDefSCC.isDef());
-
-  // If there is only one use of save exec register and that use is SI_END_CF,
-  // we can optimize SI_IF by returning the full saved exec mask instead of
-  // just cleared bits.
-  bool SimpleIf = isSimpleIf(MI, MRI);
-
-  if (SimpleIf) {
-    // Check for SI_KILL_*_TERMINATOR on path from if to endif.
-    // if there is any such terminator simplifications are not safe.
-    auto UseMI = MRI->use_instr_nodbg_begin(SaveExecReg);
-    SimpleIf = !hasKill(MI.getParent(), UseMI->getParent());
-  }
-
-  // Add an implicit def of exec to discourage scheduling VALU after this which
-  // will interfere with trying to form s_and_saveexec_b64 later.
-  Register CopyReg = SimpleIf ? SaveExecReg
-                       : MRI->createVirtualRegister(BoolRC);
-  MachineInstr *CopyExec =
-    BuildMI(MBB, I, DL, TII->get(AMDGPU::COPY), CopyReg)
-    .addReg(Exec)
-    .addReg(Exec, RegState::ImplicitDefine);
-  LoweredIf.insert(CopyReg);
-
-  Register Tmp = MRI->createVirtualRegister(BoolRC);
-
-  MachineInstr *And =
-    BuildMI(MBB, I, DL, TII->get(AndOpc), Tmp)
-    .addReg(CopyReg)
-    .add(Cond);
-  if (LV)
-    LV->replaceKillInstruction(Cond.getReg(), MI, *And);
-
-  setImpSCCDefDead(*And, true);
-
-  MachineInstr *Xor = nullptr;
-  if (!SimpleIf) {
-    Xor =
-      BuildMI(MBB, I, DL, TII->get(XorOpc), SaveExecReg)
-      .addReg(Tmp)
-      .addReg(CopyReg);
-    setImpSCCDefDead(*Xor, ImpDefSCC.isDead());
-  }
-
-  // Use a copy that is a terminator to get correct spill code placement it with
-  // fast regalloc.
-  MachineInstr *SetExec =
-    BuildMI(MBB, I, DL, TII->get(MovTermOpc), Exec)
-    .addReg(Tmp, RegState::Kill);
+  Register CondReg = Cond.getReg();
+
+  Register MaskThen = MRI->createVirtualRegister(BoolRC);
+  // Get rid of the garbage bits in the Cond register which might be coming from
+  // the bitwise arithmetic when one of the expression operands is coming from
+  // the outer scope and hence having extra bits set.
+  MachineInstr *CondFiltered = BuildMI(MBB, I, DL, TII->get(AndOpc), MaskThen)
+                                   .add(Cond)
+                                   .addReg(Exec);
   if (LV)
-    LV->getVarInfo(Tmp).Kills.push_back(SetExec);
-
-  // Skip ahead to the unconditional branch in case there are other terminators
-  // present.
-  I = skipToUncondBrOrEnd(MBB, I);
+    LV->replaceKillInstruction(CondReg, MI, *CondFiltered);
 
-  // Insert the S_CBRANCH_EXECZ instruction which will be optimized later
-  // during SIRemoveShortExecBranches.
-  MachineInstr *NewBr = BuildMI(MBB, I, DL, TII->get(AMDGPU::S_CBRANCH_EXECZ))
-                            .add(MI.getOperand(2));
+  emitWaveDiverge(MI, MaskThen, MaskElse);
 
-  if (!LIS) {
-    MI.eraseFromParent();
-    return;
+  if (LIS) {
+    LIS->InsertMachineInstrInMaps(*CondFiltered);
+    LIS->createAndComputeVirtRegInterval(MaskThen);
   }
-
-  LIS->InsertMachineInstrInMaps(*CopyExec);
-
-  // Replace with and so we don't need to fix the live interval for condition
-  // register.
-  LIS->ReplaceMachineInstrInMaps(MI, *And);
-
-  if (!SimpleIf)
-    LIS->InsertMachineInstrInMaps(*Xor);
-  LIS->InsertMachineInstrInMaps(*SetExec);
-  LIS->InsertMachineInstrInMaps(*NewBr);
-
-  LIS->removeAllRegUnitsForPhysReg(AMDGPU::EXEC);
-  MI.eraseFromParent();
-
-  // FIXME: Is there a better way of adjusting the liveness? It shouldn't be
-  // hard to add another def here but I'm not sure how to correctly update the
-  // valno.
-  RecomputeRegs.insert(SaveExecReg);
-  LIS->createAndComputeVirtRegInterval(Tmp);
-  if (!SimpleIf)
-    LIS->createAndComputeVirtRegInterval(CopyReg);
 }
 
 void SILowerControlFlow::emitElse(MachineInstr &MI) {
-  MachineBasicBlock &MBB = *MI.getParent();
-  const DebugLoc &DL = MI.getDebugLoc();
-
-  Register DstReg = MI.getOperand(0).getReg();
-  Register SrcReg = MI.getOperand(1).getReg();
-
-  MachineBasicBlock::iterator Start = MBB.begin();
-
-  // This must be inserted before phis and any spill code inserted before the
-  // else.
-  Register SaveReg = MRI->createVirtualRegister(BoolRC);
-  MachineInstr *OrSaveExec =
-    BuildMI(MBB, Start, DL, TII->get(OrSaveExecOpc), SaveReg)
-    .add(MI.getOperand(1)); // Saved EXEC
-  if (LV)
-    LV->replaceKillInstruction(SrcReg, MI, *OrSaveExec);
-
-  MachineBasicBlock *DestBB = MI.getOperand(2).getMBB();
-
-  MachineBasicBlock::iterator ElsePt(MI);
-
-  // This accounts for any modification of the EXEC mask within the block and
-  // can be optimized out pre-RA when not required.
-  MachineInstr *And = BuildMI(MBB, ElsePt, DL, TII->get(AndOpc), DstReg)
-                          .addReg(Exec)
-                          .addReg(SaveReg);
-
-  MachineInstr *Xor =
-    BuildMI(MBB, ElsePt, DL, TII->get(XorTermrOpc), Exec)
-    .addReg(Exec)
-    .addReg(DstReg);
-
-  // Skip ahead to the unconditional branch in case there are other terminators
-  // present.
-  ElsePt = skipToUncondBrOrEnd(MBB, ElsePt);
-
-  MachineInstr *Branch =
-      BuildMI(MBB, ElsePt, DL, TII->get(AMDGPU::S_CBRANCH_EXECZ))
-          .addMBB(DestBB);
-
-  if (!LIS) {
-    MI.eraseFromParent();
-    return;
-  }
-
-  LIS->RemoveMachineInstrFromMaps(MI);
-  MI.eraseFromParent();
-
-  LIS->InsertMachineInstrInMaps(*OrSaveExec);
-  LIS->InsertMachineInstrInMaps(*And);
-
-  LIS->InsertMachineInstrInMaps(*Xor);
-  LIS->InsertMachineInstrInMaps(*Branch);
-
-  RecomputeRegs.insert(SrcReg);
-  RecomputeRegs.insert(DstReg);
-  LIS->createAndComputeVirtRegInterval(SaveReg);
-
-  // Let this be recomputed.
-  LIS->removeAllRegUnitsForPhysReg(AMDGPU::EXEC);
+  Register InvCondReg = MI.getOperand(0).getReg();
+  Register CondReg = MI.getOperand(1).getReg();
+  emitWaveDiverge(MI, CondReg, InvCondReg);
 }
 
 void SILowerControlFlow::emitIfBreak(MachineInstr &MI) {
@@ -425,141 +255,137 @@ void SILowerControlFlow::emitLoop(MachineInstr &MI) {
   MachineBasicBlock &MBB = *MI.getParent();
   const DebugLoc &DL = MI.getDebugLoc();
 
-  MachineInstr *AndN2 =
-      BuildMI(MBB, &MI, DL, TII->get(Andn2TermOpc), Exec)
-          .addReg(Exec)
-          .add(MI.getOperand(0));
+  Register Cond = MI.getOperand(0).getReg();
+  Register MaskLoop = MRI->createVirtualRegister(BoolRC);
+  Register MaskExit = MRI->createVirtualRegister(BoolRC);
+  Register AndZero = MRI->createVirtualRegister(BoolRC);
+  MachineInstr *CondLoop = BuildMI(MBB, &MI, DL, TII->get(XorOpc), MaskLoop)
+                                   .addReg(Cond)
+                                   .addReg(Exec);
+
+  MachineInstr *ExitExec = BuildMI(MBB, &MI, DL, TII->get(OrOpc), MaskExit)
+  .addReg(Cond)
+  .addReg(Exec);
+
+  MachineInstr *IfZeroMask = BuildMI(MBB, &MI, DL, TII->get(AndOpc), AndZero)
+                                 .addReg(MaskLoop)
+                                 .addImm(TestMask);
+
+  MachineInstr *SetExec= BuildMI(MBB, &MI, DL, TII->get(Select), Exec)
+                                     .addReg(MaskLoop)
+                                     .addReg(MaskExit);
+
   if (LV)
-    LV->replaceKillInstruction(MI.getOperand(0).getReg(), MI, *AndN2);
+    LV->replaceKillInstruction(MI.getOperand(0).getReg(), MI, *SetExec);
 
   auto BranchPt = skipToUncondBrOrEnd(MBB, MI.getIterator());
   MachineInstr *Branch =
-      BuildMI(MBB, BranchPt, DL, TII->get(AMDGPU::S_CBRANCH_EXECNZ))
+      BuildMI(MBB, BranchPt, DL, TII->get(AMDGPU::S_CBRANCH_SCC1))
           .add(MI.getOperand(1));
 
   if (LIS) {
     RecomputeRegs.insert(MI.getOperand(0).getReg());
-    LIS->ReplaceMachineInstrInMaps(MI, *AndN2);
+    LIS->ReplaceMachineInstrInMaps(MI, *SetExec);
+    LIS->InsertMachineInstrInMaps(*CondLoop);
+    LIS->InsertMachineInstrInMaps(*IfZeroMask);
+    LIS->InsertMachineInstrInMaps(*ExitExec);
     LIS->InsertMachineInstrInMaps(*Branch);
+    LIS->createAndComputeVirtRegInterval(MaskLoop);
+    LIS->createAndComputeVirtRegInterval(MaskExit);
+    LIS->createAndComputeVirtRegInterval(AndZero);
   }
 
   MI.eraseFromParent();
 }
 
-MachineBasicBlock::iterator
-SILowerControlFlow::skipIgnoreExecInstsTrivialSucc(
-  MachineBasicBlock &MBB, MachineBasicBlock::iterator It) const {
+void SILowerControlFlow::emitWaveDiverge(MachineInstr &MI,
+                                         Register EnabledLanesMask,
+                                         Register DisableLanesMask) {
+  MachineBasicBlock &MBB = *MI.getParent();
+  const DebugLoc &DL = MI.getDebugLoc();
+  MachineBasicBlock::iterator I(MI);
 
-  SmallSet<const MachineBasicBlock *, 4> Visited;
-  MachineBasicBlock *B = &MBB;
-  do {
-    if (!Visited.insert(B).second)
-      return MBB.end();
+  MachineInstr *CondInverted =
+      BuildMI(MBB, I, DL, TII->get(XorOpc), DisableLanesMask)
+          .addReg(EnabledLanesMask)
+          .addReg(Exec);
 
-    auto E = B->end();
-    for ( ; It != E; ++It) {
-      if (TII->mayReadEXEC(*MRI, *It))
+  if (LV) {
+    LV->replaceKillInstruction(DisableLanesMask, MI, *CondInverted);
+  }
+
+  Register TestResultReg = MRI->createVirtualRegister(BoolRC);
+  MachineInstr *IfZeroMask =
+      BuildMI(MBB, I, DL, TII->get(AndOpc), TestResultReg)
+          .addReg(EnabledLanesMask)
+          .addImm(TestMask);
+
+  Ma...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/86805