[clang] [llvm] [AMDGPU] Change CF intrinsics lowering to reconverge on predecessors (PR #108596)

Fri Sep 13 09:10:22 PDT 2024

llvmbot wrote:




@llvm/pr-subscribers-llvm-ir

Author: None (alex-t)

<details>
<summary>Changes</summary>

### Brief overview:
As we mask the bits on the CF diverge, we have to set them back after executing the instructions in a conditional block.
This currently happens at the beginning of the block where the CF converges - i.e. in the immediate post-dominator of the block where the CF diverged.  We have to ensure that none of the instructions that read the EXEC mask register are inserted before masked EXEC bits are restored to the state they had before the CF divergence. For that, we tuned the TargetInstrInfo::isBasicBlockPrologue method to report any instruction that writes EXEC as belonging to the prologue. Then it appeared that instructions loading the values spilled to the memory and used in the current block must be placed at the block beginning before they are used but after the point where EXEC mask is restored, since they are loading VGPRs and, hence, read EXEC. Hence, we had to consider all spilling opcodes to belong to the block prologue. This solution worked well until we faced the problem with live interval splitting. To split the LI across the given physical register we need to insert a copy of a virtual register before the point where the current interval interferes with another one, already assigned to the same physical register. If the LI being split is "live-in" in the block we need to put a copy at the beginning of the block. SplitKit requests isBasicBlockPrologue for the proper insertion point.
### Let's consider the following example:
`$exec = S_OR_B64 $exec, killed renamable $sgpr48_sgpr49, implicit-def $scc`
`%129034:vgpr_32 = SI_SPILL_V32_RESTORE %stack.265, $sgpr32, 0, implicit $exec`
`%129055:vgpr_32 = SI_SPILL_V32_RESTORE %stack.266, $sgpr32, 0, implicit $exec`
`%129083:vgpr_32 = SI_SPILL_V32_RESTORE %stack.267, $sgpr32, 0, implicit $exec`
**`%129635:vgpr_32 = SI_SPILL_V32_RESTORE %stack.282, $sgpr32, 0, implicit $exec`**
              **<-- another LI assigned to the same physreg starts here**
`%129657:vgpr_32 = COPY %14037:vgpr_32    <-- insertion point chosen after the "prologue"`
We are to split the LI for the virtual register %14037. The COPY position was chosen after all reloads as they are considered to belong to the block prologue. Any of them could start live interval that might have been already assigned to the same physical register which we are aiming to split across. In this case, we hit an "assert" in SplitKit.cpp reporting that the insertion point returned by the target will cause interference.

**The root cause for the all mentioned troubles is an attempt to make the common spill/split logic aware of (and involved in) the sophisticated and target-specific details of the control flow implementation.**

### We have 2 options to address this issue:
1. Further tune isBasicBlockPrologue. We could pass the LeaveBefore slot index as an argument and return "false" if the current instruction matches this slot. This could work if we only pass a valid slot index from the exact call site (SplitKit.cpp) but use an invalid slot index as a default. Although, this would require changes in the common code in the split kit, basic regalloc etc. 
2. Restore the EXEC mask ***before*** the block where the CF converges, i,e, on the post-dominance frontier. That requires a massive change along the whole CF lowering stack but puts aside all interactions with the insertion position.

### What was done
We opted for the 2nd one. We made the EXEC manipulation at the divergence point conditional which allows us to change the EXEC for the conditional block but leave it unchanged along the fall-through path. Hence, we only need to restore EXEC at the end of the conditional block.

---

Patch is 10.24 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108596.diff


329 Files Affected:

- (modified) clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu (+1-1) 
- (modified) llvm/include/llvm/Analysis/CFGPrinter.h (+5-3) 
- (modified) llvm/include/llvm/CodeGen/MachineBasicBlock.h (+1-2) 
- (modified) llvm/include/llvm/CodeGen/TargetInstrInfo.h (+1-2) 
- (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+1-1) 
- (modified) llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp (+1-2) 
- (modified) llvm/lib/CodeGen/InlineSpiller.cpp (+1-1) 
- (modified) llvm/lib/CodeGen/MachineBasicBlock.cpp (+2-2) 
- (modified) llvm/lib/CodeGen/SplitKit.cpp (+2-4) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+5-4) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h (+1-1) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+10-12) 
- (modified) llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp (+46-37) 
- (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+28-5) 
- (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+40-23) 
- (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+1-2) 
- (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+10-3) 
- (modified) llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp (+202-314) 
- (modified) llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp (+36) 
- (modified) llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp (+1-1) 
- (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/hidden-diverge-gmir.mir (+8-8) 
- (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/temporal-divergence.mir (+7-28) 
- (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/uses-value-from-cycle.mir (+2-4) 
- (modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/deprecated/hidden-diverge.mir (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll (+158-156) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll (+158-156) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+12-12) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.mir (+3-11) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+95-74) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.mir (+37-61) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+94-62) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.mir (+50-76) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+14-12) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.mir (+6-18) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll (+3-3) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.mir (-4) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll (+23-14) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll (+48-32) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-no-rtn.ll (+8-8) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-rtn.ll (+6-6) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f64.ll (+2-6) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll (+34-42) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-atomicrmw.ll (+12-20) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-function-args.ll (+3-3) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/is-safe-to-sink-bug.ll (+8-6) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-amdgcn.if.xfail.mir (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-brcond.mir (+11-11) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+60-68) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.make.buffer.rsrc.ll (+2-5) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.add.ll (+8-20) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.cmpswap.ll (+16-40) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.fadd.ll (+8-20) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.format.f16.ll (+6-15) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.format.ll (+4-10) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.ll (+24-60) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.format.f16.ll (+12-30) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.format.f32.ll (+8-20) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.ll (+24-60) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.add.ll (+4-10) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.cmpswap.ll (+8-20) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.fadd.ll (+8-20) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.format.f16.ll (+4-10) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.format.ll (+2-5) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.ll (+12-30) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.format.f16.ll (+8-20) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.format.f32.ll (+4-10) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.ll (+12-30) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.load.f16.ll (+4-10) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.load.ll (+2-5) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.f16.ll (+12-30) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.i8.ll (+12-30) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.ll (+10-25) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.load.f16.ll (+6-15) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.load.ll (+4-10) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.f16.ll (+18-45) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.i8.ll (+18-45) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.ll (+20-50) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.s.buffer.load.ll (+96-240) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.add.ll (+8-20) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.cmpswap.ll (+16-40) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.fadd.ll (+8-20) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.format.f16.ll (+6-15) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.format.ll (+4-10) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.ll (+4-10) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.format.f16.ll (+6-15) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.format.f32.ll (+4-10) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.ll (+4-10) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.add.ll (+4-10) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.cmpswap.ll (+8-20) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.fadd.ll (+8-20) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.format.f16.ll (+4-10) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.format.ll (+2-5) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.ll (+2-5) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.format.f16.ll (+4-10) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.format.f32.ll (+2-5) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.ll (+2-5) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.tbuffer.load.f16.ll (+4-10) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.tbuffer.load.ll (+2-5) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.tbuffer.load.f16.ll (+6-15) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.tbuffer.load.ll (+4-10) 
- (renamed) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wave.reconverge.i32.ll (+6-6) 
- (renamed) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wave.reconverge.i64.ll (+5-6) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+196-168) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memmove.ll (+13-12) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/localizer.ll (+23-22) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll (+22-13) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll (+23-15) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn-s-buffer-load.mir (+4-8) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.else.32.mir (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.else.64.mir (+2-2) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.image.load.1d.ll (+8-16) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.image.sample.1d.ll (+12-24) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.raw.buffer.load.ll (+6-12) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.raw.ptr.buffer.load.ll (+6-12) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.s.buffer.load.ll (+48-72) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.buffer.load.ll (+6-12) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.buffer.store.ll (+6-12) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.ptr.buffer.load.ll (+6-12) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.ptr.buffer.store.ll (+6-12) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-waterfall-agpr.mir (+4-8) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll (+262-257) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll (+225-218) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+80-73) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+76-69) 
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+79-57) 
- (modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+21-13) 
- (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (+333-240) 
- (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+1264-1000) 
- (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+3478-2793) 
- (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll (+199-155) 
- (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (+286-200) 
- (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (+286-200) 
- (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+114-92) 
- (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-nand.ll (+9-9) 
- (modified) llvm/test/CodeGen/AMDGPU/bb-prolog-spill-during-regalloc.ll (+25-21) 
- (modified) llvm/test/CodeGen/AMDGPU/block-should-not-be-in-alive-blocks.mir (+13-14) 
- (modified) llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll (+306-278) 
- (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+50-35) 
- (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+1508-1508) 
- (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+1112-1101) 
- (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+1112-1101) 
- (modified) llvm/test/CodeGen/AMDGPU/bug-sdag-emitcopyfromreg.ll (+5-65) 
- (modified) llvm/test/CodeGen/AMDGPU/bypass-div.ll (+60-36) 
- (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-flat.ll (+140-92) 
- (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-gfx1030.ll (-1) 
- (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-gfx908.ll (+5-3) 
- (modified) llvm/test/CodeGen/AMDGPU/codegen-prepare-addrspacecast-non-null.ll (+32-28) 
- (modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.ll (+493-474) 
- (modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.mir (+281-198) 
- (modified) llvm/test/CodeGen/AMDGPU/constant-fold-imm-immreg.mir (+1-1) 
- (modified) llvm/test/CodeGen/AMDGPU/cross-block-use-is-not-abi-copy.ll (+1-1) 
- (modified) llvm/test/CodeGen/AMDGPU/cse-convergent.ll (+14-13) 
- (modified) llvm/test/CodeGen/AMDGPU/dag-divergence-atomic.ll (+9-9) 
- (modified) llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll (+1-1) 
- (modified) llvm/test/CodeGen/AMDGPU/dagcombine-v1i8-extractvecelt-crash.ll (+5-3) 
- (modified) llvm/test/CodeGen/AMDGPU/div_i128.ll (+394-442) 
- (modified) llvm/test/CodeGen/AMDGPU/div_v2i128.ll (+405-349) 
- (modified) llvm/test/CodeGen/AMDGPU/divergent-branch-uniform-condition.ll (+13-8) 
- (modified) llvm/test/CodeGen/AMDGPU/dpp_combine.mir (+1-1) 
- (modified) llvm/test/CodeGen/AMDGPU/dpp_combine_gfx11.mir (+1-1) 
- (modified) llvm/test/CodeGen/AMDGPU/early-tailduplicator-terminator.mir (+7-9) 
- (modified) llvm/test/CodeGen/AMDGPU/extract-subvector.ll (+77-42) 
- (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll (+2595-2484) 
- (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll (+1395-1395) 
- (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll (+1395-1395) 
- (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll (+1720-1720) 
- (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+549-534) 
- (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+549-534) 
- (modified) llvm/test/CodeGen/AMDGPU/fneg-combines.f16.ll (+22-15) 
- (modified) llvm/test/CodeGen/AMDGPU/fneg-combines.new.ll (+12-8) 
- (modified) llvm/test/CodeGen/AMDGPU/fold-fabs.ll (+26-11) 
- (modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+6-4) 
- (modified) llvm/test/CodeGen/AMDGPU/fptoi.i128.ll (+338-204) 
- (modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+11-6) 
- (modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-no-rtn.ll (+6-6) 
- (modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-rtn.ll (+6-6) 
- (modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f64.ll (-2) 
- (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd-wrong-subtarget.ll (+11-8) 
- (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll (+2439-2426) 
- (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll (+1641-1641) 
- (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll (+1641-1641) 
- (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll (+1994-1994) 
- (modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (+320-272) 
- (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+549-534) 
- (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+549-534) 
- (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+1711-1212) 
- (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+1097-793) 
- (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+1097-793) 
- (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+1665-1186) 
- (modified) llvm/test/CodeGen/AMDGPU/i1-copy-from-loop.ll (+14-12) 
- (modified) llvm/test/CodeGen/AMDGPU/i1_copy_phi_with_phi_incoming_value.mir (+8-19) 
- (modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+245-302) 
- (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+298-290) 
- (modified) llvm/test/CodeGen/AMDGPU/indirect-call.ll (+110-131) 
- (modified) llvm/test/CodeGen/AMDGPU/infinite-loop.ll (+4) 
- (modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+18-10) 
- (modified) llvm/test/CodeGen/AMDGPU/insert_waitcnt_for_precise_memory.ll (+120-91) 
- (modified) llvm/test/CodeGen/AMDGPU/issue92561-restore-undef-scc-verifier-error.ll (+9-9) 
- (modified) llvm/test/CodeGen/AMDGPU/itofp.i128.bf.ll (+74-40) 
- (modified) llvm/test/CodeGen/AMDGPU/itofp.i128.ll (+459-255) 
- (modified) llvm/test/CodeGen/AMDGPU/kill-infinite-loop.ll (+48-36) 
- (modified) llvm/test/CodeGen/AMDGPU/lds-global-non-entry-func.ll (+110-66) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.format.f16.ll (+10-15) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.format.ll (+12-18) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.ll (+70-105) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.format.f16.ll (+22-33) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.format.f32.ll (+24-36) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.ll (+66-99) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.format.f16.ll (+10-15) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.format.ll (+12-18) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.ll (+70-105) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.format.f16.ll (+22-33) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.format.f32.ll (+24-36) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.ll (+66-99) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.load.f16.ll (+16-24) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.load.ll (+18-27) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.store.f16.ll (+20-30) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.store.ll (+50-75) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.load.f16.ll (+16-24) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.load.ll (+18-27) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.store.f16.ll (+20-30) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.store.ll (+50-75) 
- (modified) llvm/test/CodeGen/AMDGPU/legalize-soffset-mbuf.ll (+64-96) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i32.ll (+14-8) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll (+14-8) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.atomic.buffer.load.ll (+40-29) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.atomic.buffer.load.ll (+40-29) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.atomic.fadd.v2bf16.ll (+4-4) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll (+158-108) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll (+158-108) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.softwqm.ll (+20-12) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll (+44-32) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.format.v3f16.ll (+17-15) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.atomic.buffer.load.ll (+44-32) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd.v2bf16.ll (+8-8) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd_nortn.ll (+26-26) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd_rtn.ll (+20-20) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fmax.f32.ll (+34-32) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fmax.f64.ll (+12-12) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fmin.f32.ll (+34-32) 


``````````diff

diff --git a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
index 946927d88a1ee1..3ca766755a6319 100644
--- a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
+++ b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
@@ -10,7 +10,7 @@
 // GFX90A-CAS: A compare and swap loop was generated for an atomic fadd operation at system memory scope
 // GFX90A-CAS-LABEL: _Z14atomic_add_casPf
 // GFX90A-CAS:  flat_atomic_cmpswap
-// GFX90A-CAS:  s_cbranch_execnz
+// GFX90A-CAS:  s_cbranch_scc1
 __device__ float atomic_add_cas(float *p) {
   return __atomic_fetch_add(p, 1.0f, memory_order_relaxed);
 }
diff --git a/llvm/include/llvm/Analysis/CFGPrinter.h b/llvm/include/llvm/Analysis/CFGPrinter.h
index cd785331d1f146..e24a9110d596ca 100644
--- a/llvm/include/llvm/Analysis/CFGPrinter.h
+++ b/llvm/include/llvm/Analysis/CFGPrinter.h
@@ -272,9 +272,11 @@ struct DOTGraphTraits<DOTFuncInfo *> : public DefaultDOTGraphTraits {
     unsigned OpNo = I.getSuccessorIndex();
     const Instruction *TI = Node->getTerminator();
     BasicBlock *SuccBB = TI->getSuccessor(OpNo);
-    auto BranchProb = CFGInfo->getBPI()->getEdgeProbability(Node, SuccBB);
-    double WeightPercent = ((double)BranchProb.getNumerator()) /
-                           ((double)BranchProb.getDenominator());
+    // auto BranchProb = CFGInfo->getBPI()->getEdgeProbability(Node, SuccBB);
+    // double WeightPercent = ((double)BranchProb.getNumerator()) /
+    //                        ((double)BranchProb.getDenominator());
+    double WeightPercent = 0.5;
+
     std::string TTAttr =
         formatv("tooltip=\"{0} -> {1}\\nProbability {2:P}\" ", getBBName(Node),
                 getBBName(SuccBB), WeightPercent);
diff --git a/llvm/include/llvm/CodeGen/MachineBasicBlock.h b/llvm/include/llvm/CodeGen/MachineBasicBlock.h
index 6efb17c55493a9..9fcda791fb4c72 100644
--- a/llvm/include/llvm/CodeGen/MachineBasicBlock.h
+++ b/llvm/include/llvm/CodeGen/MachineBasicBlock.h
@@ -879,8 +879,7 @@ class MachineBasicBlock
   /// debug.  This is the correct point to insert copies at the beginning of a
   /// basic block. \p Reg is the register being used by a spill or defined for a
   /// restore/split during register allocation.
-  iterator SkipPHIsLabelsAndDebug(iterator I, Register Reg = Register(),
-                                  bool SkipPseudoOp = true);
+  iterator SkipPHIsLabelsAndDebug(iterator I, bool SkipPseudoOp = true);
 
   /// Returns an iterator to the first terminator instruction of this basic
   /// block. If a terminator does not exist, it returns end().
diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
index 49ce13dd8cbe39..984850980f4c9c 100644
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@@ -2058,8 +2058,7 @@ class TargetInstrInfo : public MCInstrInfo {
   /// other instructions shall be inserted before it. This can be implemented
   /// to prevent register allocator to insert spills for \p Reg before such
   /// instructions.
-  virtual bool isBasicBlockPrologue(const MachineInstr &MI,
-                                    Register Reg = Register()) const {
+  virtual bool isBasicBlockPrologue(const MachineInstr &MI) const {
     return false;
   }
 
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index dc13a35c66f9ab..195ceb64eae4a8 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -3098,7 +3098,7 @@ def int_amdgcn_loop : Intrinsic<[llvm_i1_ty],
   [llvm_anyint_ty], [IntrWillReturn, IntrNoCallback, IntrNoFree]
 >;
 
-def int_amdgcn_end_cf : Intrinsic<[], [llvm_anyint_ty],
+def int_amdgcn_wave_reconverge : Intrinsic<[], [llvm_anyint_ty],
   [IntrWillReturn, IntrNoCallback, IntrNoFree]>;
 
 // Represent unreachable in a divergent region.
diff --git a/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp b/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
index 3bb9da5f1a37bb..184b493694894d 100644
--- a/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
+++ b/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
@@ -461,8 +461,7 @@ class StatepointState {
 
       if (EHPad && !RC.hasReload(Reg, RegToSlotIdx[Reg], EHPad)) {
         RC.recordReload(Reg, RegToSlotIdx[Reg], EHPad);
-        auto EHPadInsertPoint =
-            EHPad->SkipPHIsLabelsAndDebug(EHPad->begin(), Reg);
+        auto EHPadInsertPoint = EHPad->SkipPHIsLabelsAndDebug(EHPad->begin());
         insertReloadBefore(Reg, EHPadInsertPoint, EHPad);
         LLVM_DEBUG(dbgs() << "...also reload at EHPad "
                           << printMBBReference(*EHPad) << "\n");
diff --git a/llvm/lib/CodeGen/InlineSpiller.cpp b/llvm/lib/CodeGen/InlineSpiller.cpp
index 81ae805d64e1ec..201d3a5df3a536 100644
--- a/llvm/lib/CodeGen/InlineSpiller.cpp
+++ b/llvm/lib/CodeGen/InlineSpiller.cpp
@@ -463,7 +463,7 @@ bool InlineSpiller::hoistSpillInsideBB(LiveInterval &SpillLI,
   MachineBasicBlock *MBB = LIS.getMBBFromIndex(SrcVNI->def);
   MachineBasicBlock::iterator MII;
   if (SrcVNI->isPHIDef())
-    MII = MBB->SkipPHIsLabelsAndDebug(MBB->begin(), SrcReg);
+    MII = MBB->SkipPHIsLabelsAndDebug(MBB->begin());
   else {
     MachineInstr *DefMI = LIS.getInstructionFromIndex(SrcVNI->def);
     assert(DefMI && "Defining instruction disappeared");
diff --git a/llvm/lib/CodeGen/MachineBasicBlock.cpp b/llvm/lib/CodeGen/MachineBasicBlock.cpp
index 5d06af3ebf3360..419d7e0312ae08 100644
--- a/llvm/lib/CodeGen/MachineBasicBlock.cpp
+++ b/llvm/lib/CodeGen/MachineBasicBlock.cpp
@@ -223,13 +223,13 @@ MachineBasicBlock::SkipPHIsAndLabels(MachineBasicBlock::iterator I) {
 
 MachineBasicBlock::iterator
 MachineBasicBlock::SkipPHIsLabelsAndDebug(MachineBasicBlock::iterator I,
-                                          Register Reg, bool SkipPseudoOp) {
+                                          bool SkipPseudoOp) {
   const TargetInstrInfo *TII = getParent()->getSubtarget().getInstrInfo();
 
   iterator E = end();
   while (I != E && (I->isPHI() || I->isPosition() || I->isDebugInstr() ||
                     (SkipPseudoOp && I->isPseudoProbe()) ||
-                    TII->isBasicBlockPrologue(*I, Reg)))
+                    TII->isBasicBlockPrologue(*I)))
     ++I;
   // FIXME: This needs to change if we wish to bundle labels / dbg_values
   // inside the bundle.
diff --git a/llvm/lib/CodeGen/SplitKit.cpp b/llvm/lib/CodeGen/SplitKit.cpp
index b671e510387530..22991a0fb4cb1e 100644
--- a/llvm/lib/CodeGen/SplitKit.cpp
+++ b/llvm/lib/CodeGen/SplitKit.cpp
@@ -806,10 +806,8 @@ SlotIndex SplitEditor::leaveIntvAtTop(MachineBasicBlock &MBB) {
     return Start;
   }
 
-  unsigned RegIdx = 0;
-  Register Reg = LIS.getInterval(Edit->get(RegIdx)).reg();
-  VNInfo *VNI = defFromParent(RegIdx, ParentVNI, Start, MBB,
-                              MBB.SkipPHIsLabelsAndDebug(MBB.begin(), Reg));
+  VNInfo *VNI = defFromParent(0, ParentVNI, Start, MBB,
+                              MBB.SkipPHIsLabelsAndDebug(MBB.begin()));
   RegAssign.insert(Start, VNI->def, OpenIdx);
   LLVM_DEBUG(dump());
   return VNI->def;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 3fcb364fc2c536..c0d2853d159882 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -1551,11 +1551,12 @@ bool AMDGPUInstructionSelector::selectReturnAddress(MachineInstr &I) const {
   return true;
 }
 
-bool AMDGPUInstructionSelector::selectEndCfIntrinsic(MachineInstr &MI) const {
+bool AMDGPUInstructionSelector::selectWaveReconvergeIntrinsic(
+    MachineInstr &MI) const {
   // FIXME: Manually selecting to avoid dealing with the SReg_1 trick
   // SelectionDAG uses for wave32 vs wave64.
   MachineBasicBlock *BB = MI.getParent();
-  BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_END_CF))
+  BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_WAVE_RECONVERGE))
       .add(MI.getOperand(1));
 
   Register Reg = MI.getOperand(1).getReg();
@@ -2083,8 +2084,8 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC_W_SIDE_EFFECTS(
     MachineInstr &I) const {
   Intrinsic::ID IntrinsicID = cast<GIntrinsic>(I).getIntrinsicID();
   switch (IntrinsicID) {
-  case Intrinsic::amdgcn_end_cf:
-    return selectEndCfIntrinsic(I);
+  case Intrinsic::amdgcn_wave_reconverge:
+    return selectWaveReconvergeIntrinsic(I);
   case Intrinsic::amdgcn_ds_ordered_add:
   case Intrinsic::amdgcn_ds_ordered_swap:
     return selectDSOrderedIntrinsic(I, IntrinsicID);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
index 068db5c1c14496..c3ba26590dfbcf 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
@@ -116,7 +116,7 @@ class AMDGPUInstructionSelector final : public InstructionSelector {
   bool selectReturnAddress(MachineInstr &I) const;
   bool selectG_INTRINSIC(MachineInstr &I) const;
 
-  bool selectEndCfIntrinsic(MachineInstr &MI) const;
+  bool selectWaveReconvergeIntrinsic(MachineInstr &MI) const;
   bool selectDSOrderedIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
   bool selectDSGWSIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
   bool selectDSAppendConsume(MachineInstr &MI, bool IsAppend) const;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 4737a322c255f4..1d2ee6a4c96514 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -785,8 +785,6 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
   const TargetRegisterClass *WaveRC = TRI->getWaveMaskRegClass();
   const unsigned MovExecOpc =
       Subtarget.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
-  const unsigned MovExecTermOpc =
-      Subtarget.isWave32() ? AMDGPU::S_MOV_B32_term : AMDGPU::S_MOV_B64_term;
 
   const unsigned XorTermOpc = Subtarget.isWave32() ?
     AMDGPU::S_XOR_B32_term : AMDGPU::S_XOR_B64_term;
@@ -949,27 +947,27 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
 
   B.setInsertPt(*BodyBB, BodyBB->end());
 
+  Register LoopMask = MRI.createVirtualRegister(
+      TRI->getRegClass(AMDGPU::SReg_1_XEXECRegClassID));
   // Update EXEC, switch all done bits to 0 and all todo bits to 1.
   B.buildInstr(XorTermOpc)
-    .addDef(ExecReg)
-    .addReg(ExecReg)
-    .addReg(NewExec);
+    .addDef(LoopMask)
+	.addReg(ExecReg)
+	.addReg(NewExec);
 
   // XXX - s_xor_b64 sets scc to 1 if the result is nonzero, so can we use
   // s_cbranch_scc0?
 
   // Loop back to V_READFIRSTLANE_B32 if there are still variants to cover.
-  B.buildInstr(AMDGPU::SI_WATERFALL_LOOP).addMBB(LoopBB);
+  B.buildInstr(AMDGPU::SI_WATERFALL_LOOP)
+      .addReg(LoopMask)
+      .addReg(NewExec)
+      .addMBB(LoopBB);
 
   // Save the EXEC mask before the loop.
   BuildMI(MBB, MBB.end(), DL, TII->get(MovExecOpc), SaveExecReg)
     .addReg(ExecReg);
 
-  // Restore the EXEC mask after the loop.
-  B.setMBB(*RestoreExecBB);
-  B.buildInstr(MovExecTermOpc)
-    .addDef(ExecReg)
-    .addReg(SaveExecReg);
 
   // Set the insert point after the original instruction, so any new
   // instructions will be in the remainder.
@@ -4967,7 +4965,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
       OpdsMapping[1] = AMDGPU::getValueMapping(Bank, 32);
       break;
     }
-    case Intrinsic::amdgcn_end_cf: {
+    case Intrinsic::amdgcn_wave_reconverge: {
       unsigned Size = getSizeInBits(MI.getOperand(1).getReg(), MRI, *TRI);
       OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
       break;
diff --git a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
index edd881c84078c6..cd8cbcc7f689d4 100644
--- a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
@@ -14,6 +14,7 @@
 #include "AMDGPU.h"
 #include "AMDGPUTargetMachine.h"
 #include "GCNSubtarget.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
 #include "llvm/Analysis/LoopInfo.h"
 #include "llvm/Analysis/UniformityAnalysis.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
@@ -55,7 +56,7 @@ class SIAnnotateControlFlow {
   Function *Else;
   Function *IfBreak;
   Function *Loop;
-  Function *EndCf;
+  Function *WaveReconverge;
 
   DominatorTree *DT;
   StackVector Stack;
@@ -88,7 +89,7 @@ class SIAnnotateControlFlow {
 
   bool handleLoop(BranchInst *Term);
 
-  bool closeControlFlow(BasicBlock *BB);
+  bool tryWaveReconverge(BasicBlock *BB);
 
 public:
   SIAnnotateControlFlow(Module &M, const GCNSubtarget &ST, DominatorTree &DT,
@@ -123,7 +124,8 @@ void SIAnnotateControlFlow::initialize(Module &M, const GCNSubtarget &ST) {
   IfBreak = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_if_break,
                                       { IntMask });
   Loop = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_loop, { IntMask });
-  EndCf = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_end_cf, { IntMask });
+  WaveReconverge = Intrinsic::getDeclaration(
+      &M, Intrinsic::amdgcn_wave_reconverge, {IntMask});
 }
 
 /// Is the branch condition uniform or did the StructurizeCFG pass
@@ -185,8 +187,6 @@ bool SIAnnotateControlFlow::eraseIfUnused(PHINode *Phi) {
 
 /// Open a new "If" block
 bool SIAnnotateControlFlow::openIf(BranchInst *Term) {
-  if (isUniform(Term))
-    return false;
 
   IRBuilder<> IRB(Term);
   Value *IfCall = IRB.CreateCall(If, {Term->getCondition()});
@@ -287,43 +287,43 @@ bool SIAnnotateControlFlow::handleLoop(BranchInst *Term) {
 }
 
 /// Close the last opened control flow
-bool SIAnnotateControlFlow::closeControlFlow(BasicBlock *BB) {
-  llvm::Loop *L = LI->getLoopFor(BB);
-
-  assert(Stack.back().first == BB);
+bool SIAnnotateControlFlow::tryWaveReconverge(BasicBlock *BB) {
 
-  if (L && L->getHeader() == BB) {
-    // We can't insert an EndCF call into a loop header, because it will
-    // get executed on every iteration of the loop, when it should be
-    // executed only once before the loop.
-    SmallVector <BasicBlock *, 8> Latches;
-    L->getLoopLatches(Latches);
+  if (succ_empty(BB))
+    return false;
 
+  BranchInst *Term = dyn_cast<BranchInst>(BB->getTerminator());
+  if (Term->getNumSuccessors() == 1) {
+    // The current BBs single successor is a top of the stack. We need to
+    // reconverge over thaqt path.
+    BasicBlock *SingleSucc = *succ_begin(BB);
+    BasicBlock::iterator InsPt = Term ? BasicBlock::iterator(Term) : BB->end();
+    if (isTopOfStack(SingleSucc)) {
+      Value *Exec = Stack.back().second;
+      IRBuilder<>(BB, InsPt).CreateCall(WaveReconverge, {Exec});
+    }
+  } else {
+    // We have a uniform conditional branch terminating the block.
+    // THis block may be the last in the Then path of the enclosing divergent
+    // IF.
+    if (!isUniform(Term))
+      // Divergent loop is going to be further processed in another place
+      return false;
+
+    for (auto Succ : Term->successors()) {
+      if (isTopOfStack(Succ)) {
+        // Just split to make a room for further WAVE_RECONVERGE insertion
     SmallVector<BasicBlock *, 2> Preds;
-    for (BasicBlock *Pred : predecessors(BB)) {
-      if (!is_contained(Latches, Pred))
-        Preds.push_back(Pred);
+        for (auto P : predecessors(Succ)) {
+          if (DT->dominates(BB, P))
+            Preds.push_back(P);
     }
-
-    BB = SplitBlockPredecessors(BB, Preds, "endcf.split", DT, LI, nullptr,
+        DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Eager);
+        SplitBlockPredecessors(Succ, Preds, ".reconverge", &DTU, LI, nullptr,
                                 false);
   }
 
-  Value *Exec = popSaved();
-  BasicBlock::iterator FirstInsertionPt = BB->getFirstInsertionPt();
-  if (!isa<UndefValue>(Exec) && !isa<UnreachableInst>(FirstInsertionPt)) {
-    Instruction *ExecDef = cast<Instruction>(Exec);
-    BasicBlock *DefBB = ExecDef->getParent();
-    if (!DT->dominates(DefBB, BB)) {
-      // Split edge to make Def dominate Use
-      FirstInsertionPt = SplitEdge(DefBB, BB, DT, LI)->getFirstInsertionPt();
     }
-    IRBuilder<> IRB(FirstInsertionPt->getParent(), FirstInsertionPt);
-    // TODO: StructurizeCFG 'Flow' blocks have debug locations from the
-    // condition, for now just avoid copying these DebugLocs so that stepping
-    // out of the then/else block in a debugger doesn't step to the condition.
-    IRB.SetCurrentDebugLocation(DebugLoc());
-    IRB.CreateCall(EndCf, {Exec});
   }
 
   return true;
@@ -341,14 +341,18 @@ bool SIAnnotateControlFlow::run(Function &F) {
 
     if (!Term || Term->isUnconditional()) {
       if (isTopOfStack(BB))
-        Changed |= closeControlFlow(BB);
+        Stack.pop_back();
+      Changed |= tryWaveReconverge(BB);
 
       continue;
     }
 
     if (I.nodeVisited(Term->getSuccessor(1))) {
       if (isTopOfStack(BB))
-        Changed |= closeControlFlow(BB);
+        Stack.pop_back();
+      // Let's take care of uniform loop latch that may be closing the Then
+      // path of the enclosing divergent branch.
+      Changed |= tryWaveReconverge(BB);
 
       if (DT->dominates(Term->getSuccessor(1), BB))
         Changed |= handleLoop(Term);
@@ -363,9 +367,14 @@ bool SIAnnotateControlFlow::run(Function &F) {
         continue;
       }
 
-      Changed |= closeControlFlow(BB);
+      Stack.pop_back();
     }
 
+    if (isUniform(Term))
+      // Uniform conditional branch may be in the block that closes the Then
+      // path of the divergent conditional branch.
+      Changed |= tryWaveReconverge(BB);
+    else
     Changed |= openIf(Term);
   }
 
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 81b52935ddf397..ae3b849a55ff2e 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6475,7 +6475,7 @@ unsigned SITargetLowering::isCFIntrinsic(const SDNode *Intr) const {
       return AMDGPUISD::ELSE;
     case Intrinsic::amdgcn_loop:
       return AMDGPUISD::LOOP;
-    case Intrinsic::amdgcn_end_cf:
+    case Intrinsic::amdgcn_wave_reconverge:
       llvm_unreachable("should not occur");
     default:
       return 0;
@@ -9848,9 +9848,10 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
 
     return SDValue(Load, 0);
   }
-  case Intrinsic::amdgcn_end_cf:
-    return SDValue(DAG.getMachineNode(AMDGPU::SI_END_CF, DL, MVT::Other,
-                                      Op->getOperand(2), Chain), 0);
+  case Intrinsic::amdgcn_wave_reconverge:
+    return SDValue(DAG.getMachineNode(AMDGPU::SI_WAVE_RECONVERGE, DL,
+                                      MVT::Other, Op->getOperand(2), Chain),
+                   0);
   case Intrinsic::amdgcn_s_barrier_init:
   case Intrinsic::amdgcn_s_barrier_join:
   case Intrinsic::amdgcn_s_wakeup_barrier: {
@@ -15693,6 +15694,28 @@ void SITargetLowering::finalizeLowering(MachineFunction &MF) const {
     }
   }
 
+  // ISel inserts copy to regs for the successor PHIs
+  // at the BB end. We need to move the SI_WAVE_RECONVERGE right before the
+  // branch.
+  for (auto &MBB : MF) {
+    for (auto &MI : MBB) {
+      if (MI.getOpcode() == AMDGPU::SI_WAVE_RECONVERGE) {
+        MachineBasicBlock::iterator I(MI);
+        MachineBasicBlock::iterator Next = std::next(I);
+        bool NeedToMove = false;
+        while (Next != MBB.end() && !Next->isBranch()) {
+          NeedToMove = true;
+          Next++;
+        }
+        assert((Next == MBB.end() || !Next->readsRegister(AMDGPU::SCC, TRI)) &&
+               "Malformed CFG detected!\n");
+        if (NeedToMove) {
+          MBB.splice(Next, &MBB, &MI);
+        }
+        break;
+      }
+    }
+  }
   // FIXME: This is a hack to fixup AGPR classes to use the properly aligned
   // classes if required. Ideally the register class constraints would differ
   // per-subtarget, but there's no easy way to achieve that right now. This is
@@ -16451,7 +16474,7 @@ static bool hasCFUser(const Value *V, SmallPtrSet<const Value *, 16> &Visited,
         default:
           Result = false;
           break;
-        case Intrinsic::amdgcn_end_cf:
+        case Intrinsic::amdgcn_wave_reconverge:
         case Intrinsic::amdgcn_loop:
           Result = true;
           break;
diff...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/108596