[llvm] [CHR] Skip regions containing convergent calls (PR #180882)

Tue Feb 10 20:33:33 PST 2026

https://github.com/yxsamliu created https://github.com/llvm/llvm-project/pull/180882

[CHR] Skip regions containing convergent calls

CHR (Control Height Reduction) merges multiple biased branches into a
single speculative check, cloning the region into hot/cold paths. On
GPU targets, the merged branch may be divergent (evaluated per-thread),
splitting the wavefront: some threads take the hot path, others the
cold path.

A convergent call like ds_bpermute (a cross-lane operation on AMDGPU)
requires a specific set of threads to be active — when thread X reads
from thread Y, thread Y must be active and participating in the same
call. After CHR cloning, thread Y may have gone to the cold path while
thread X is on the hot path, so the hot-path ds_bpermute reads a stale
register value from thread Y instead of the intended value.

This caused a miscompilation in rocPRIM's lookback scan: CHR duplicated
a region containing ds_bpermute, and the hot-path copy executed with a
different set of active threads, reading incorrect cross-lane data and
causing a memory access fault.

The fix skips any region containing convergent or noduplicate calls,
following the same pattern as SimplifyCFG's block-duplication guard.



>From a84aeb8a4e6f3dd16a7850ca17feb7751c813229 Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <yaxun.liu at amd.com>
Date: Sat, 7 Feb 2026 19:04:44 -0500
Subject: [PATCH] [CHR] Skip regions containing convergent calls
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CHR (Control Height Reduction) merges multiple biased branches into a
single speculative check, cloning the region into hot/cold paths. On
GPU targets, the merged branch may be divergent (evaluated per-thread),
splitting the wavefront: some threads take the hot path, others the
cold path.

A convergent call like ds_bpermute (a cross-lane operation on AMDGPU)
requires a specific set of threads to be active — when thread X reads
from thread Y, thread Y must be active and participating in the same
call. After CHR cloning, thread Y may have gone to the cold path while
thread X is on the hot path, so the hot-path ds_bpermute reads a stale
register value from thread Y instead of the intended value.

This caused a miscompilation in rocPRIM's lookback scan: CHR duplicated
a region containing ds_bpermute, and the hot-path copy executed with a
different set of active threads, reading incorrect cross-lane data and
causing a memory access fault.

The fix skips any region containing convergent or noduplicate calls,
following the same pattern as SimplifyCFG's block-duplication guard.
---
 .../ControlHeightReduction.cpp                |  20 +-
 .../Transforms/PGOProfile/chr-convergent.ll   | 179 ++++++++++++++++++
 2 files changed, 198 insertions(+), 1 deletion(-)
 create mode 100644 llvm/test/Transforms/PGOProfile/chr-convergent.ll

diff --git a/llvm/lib/Transforms/Instrumentation/ControlHeightReduction.cpp b/llvm/lib/Transforms/Instrumentation/ControlHeightReduction.cpp
index c7b941319f8b9..f62d29e01540d 100644
--- a/llvm/lib/Transforms/Instrumentation/ControlHeightReduction.cpp
+++ b/llvm/lib/Transforms/Instrumentation/ControlHeightReduction.cpp
@@ -744,10 +744,28 @@ CHRScope * CHR::findScope(Region *R) {
     // FIXME: This could lead to less optimal codegen, because the region is
     // excluded, it can prevent CHR from merging adjacent regions into bigger
     // scope and hoisting more branches.
-    for (Instruction &I : *BB)
+    for (Instruction &I : *BB) {
       if (auto *II = dyn_cast<IntrinsicInst>(&I))
         if (II->getIntrinsicID() == Intrinsic::coro_id)
           return nullptr;
+      // Can't clone regions containing convergent or noduplicate calls.
+      //
+      // CHR clones a region into hot/cold paths guarded by a merged
+      // speculative branch. On GPU targets, this branch may be divergent
+      // (different threads evaluate it differently), splitting the set of
+      // threads that reach each copy. A convergent call (e.g. a cross-lane
+      // operation like ds_bpermute on AMDGPU) requires a specific set of
+      // threads to be active; when CHR places a copy on the hot path, only
+      // the threads that took the hot branch are active, so the operation
+      // reads stale values from threads that went to the cold path.
+      //
+      // Similarly, noduplicate calls must not be duplicated by definition.
+      //
+      // This matches SimplifyCFG's block-duplication guard.
+      if (auto *CB = dyn_cast<CallBase>(&I))
+        if (CB->cannotDuplicate() || CB->isConvergent())
+          return nullptr;
+    }
   }
 
   if (Exit) {
diff --git a/llvm/test/Transforms/PGOProfile/chr-convergent.ll b/llvm/test/Transforms/PGOProfile/chr-convergent.ll
new file mode 100644
index 0000000000000..e8d0235d0ecb6
--- /dev/null
+++ b/llvm/test/Transforms/PGOProfile/chr-convergent.ll
@@ -0,0 +1,179 @@
+; Test that CHR does not transform regions containing convergent or
+; noduplicate calls, following the same guard as SimplifyCFG.
+;
+; CHR (Control Height Reduction) merges multiple biased branches into a
+; single speculative check, cloning the region into hot/cold paths. On GPU
+; targets, this merged branch may be divergent (per-thread), splitting the
+; wavefront: some threads take the hot path, others the cold path.
+;
+; A convergent call like ds_bpermute (a cross-lane operation on AMDGPU)
+; requires a specific set of threads to be active — when thread X reads
+; from thread Y via ds_bpermute, thread Y must be active and participating
+; in the same call. After CHR cloning, thread Y may have gone to the cold
+; path while thread X is on the hot path, so the hot-path ds_bpermute reads
+; a stale register value from thread Y instead of the intended value.
+;
+; Similarly, noduplicate calls must not be duplicated by definition.
+;
+; RUN: opt < %s -passes='require<profile-summary>,function(chr)' -S | FileCheck %s
+
+target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-p9:192:256:256:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8:9"
+target triple = "amdgcn-amd-amdhsa"
+
+declare i32 @llvm.amdgcn.workitem.id.x()
+declare i32 @llvm.amdgcn.ds.bpermute(i32, i32) #0
+
+; Two biased divergent branches where the first region contains a convergent
+; cross-lane operation (ds_bpermute). CHR must not clone this region.
+;
+; Original code (should be preserved as-is):
+;   if (val > 0)           // Biased true, per-thread condition
+;     result = bpermute()  // Cross-lane read: thread X reads from thread Y
+;   if (val < 100)         // Biased true, per-thread condition
+;     output[tid] = result
+;
+; Without this fix, CHR would transform to:
+;   if (val > 0 && val < 100) {  // Merged speculative branch (hot path)
+;     result = bpermute()        // BUG: thread Y may not be on hot path,
+;                                //   so thread X reads stale value from Y
+;     output[tid] = result
+;   } else {                     // Cold path (.nonchr clone)
+;     if (val > 0) result = bpermute()
+;     if (val < 100) output[tid] = result
+;   }
+;
+; The merged branch splits the wavefront differently than the original
+; branches, changing which threads are active at the bpermute call site.
+;
+define amdgpu_kernel void @test_chr_convergent(
+    ptr addrspace(1) %input,
+    ptr addrspace(1) %output) !prof !14 {
+; CHECK-LABEL: @test_chr_convergent(
+; CHECK-NOT: nonchr
+; CHECK: entry:
+; CHECK:   %cond1 = icmp sgt i32 %val, 0
+; CHECK:   br i1 %cond1, label %bb1, label %merge1
+; CHECK: bb1:
+; CHECK:   %perm = call i32 @llvm.amdgcn.ds.bpermute(i32 %lane_idx, i32 %val)
+; CHECK: merge1:
+; CHECK:   %cond2 = icmp slt i32 %val, 100
+; CHECK:   br i1 %cond2, label %bb2, label %merge2
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %gep_in = getelementptr inbounds i32, ptr addrspace(1) %input, i32 %tid
+  %val = load i32, ptr addrspace(1) %gep_in, align 4
+  %cond1 = icmp sgt i32 %val, 0
+  br i1 %cond1, label %bb1, label %merge1, !prof !15
+
+bb1:
+  %lane_idx = shl i32 %tid, 2
+  %perm = call i32 @llvm.amdgcn.ds.bpermute(i32 %lane_idx, i32 %val)
+  br label %merge1
+
+merge1:
+  %result = phi i32 [ %perm, %bb1 ], [ 0, %entry ]
+  %cond2 = icmp slt i32 %val, 100
+  br i1 %cond2, label %bb2, label %merge2, !prof !15
+
+bb2:
+  %gep_out = getelementptr inbounds i32, ptr addrspace(1) %output, i32 %tid
+  store i32 %result, ptr addrspace(1) %gep_out, align 4
+  br label %merge2
+
+merge2:
+  ret void
+}
+
+; Same pattern but with a noduplicate call instead of convergent.
+; CHR must also skip this region.
+declare void @noduplicate_callee() #1
+
+define amdgpu_kernel void @test_chr_noduplicate(
+    ptr addrspace(1) %input,
+    ptr addrspace(1) %output) !prof !14 {
+; CHECK-LABEL: @test_chr_noduplicate(
+; CHECK-NOT: nonchr
+; CHECK: entry:
+; CHECK:   br i1 %cond1, label %bb1, label %merge1
+; CHECK: bb1:
+; CHECK:   call void @noduplicate_callee()
+; CHECK: merge1:
+; CHECK:   br i1 %cond2, label %bb2, label %merge2
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %gep_in = getelementptr inbounds i32, ptr addrspace(1) %input, i32 %tid
+  %val = load i32, ptr addrspace(1) %gep_in, align 4
+  %cond1 = icmp sgt i32 %val, 0
+  br i1 %cond1, label %bb1, label %merge1, !prof !15
+
+bb1:
+  call void @noduplicate_callee()
+  br label %merge1
+
+merge1:
+  %cond2 = icmp slt i32 %val, 100
+  br i1 %cond2, label %bb2, label %merge2, !prof !15
+
+bb2:
+  %gep_out = getelementptr inbounds i32, ptr addrspace(1) %output, i32 %tid
+  store i32 %val, ptr addrspace(1) %gep_out, align 4
+  br label %merge2
+
+merge2:
+  ret void
+}
+
+; A case without convergent or noduplicate calls — CHR should still transform.
+; This verifies the fix is targeted (not overly broad).
+define amdgpu_kernel void @test_chr_no_convergent(
+    ptr addrspace(1) %input,
+    ptr addrspace(1) %output) !prof !14 {
+; CHECK-LABEL: @test_chr_no_convergent(
+; CHECK: entry.split:
+; CHECK: entry.split.nonchr:
+entry:
+  %tid = call i32 @llvm.amdgcn.workitem.id.x()
+  %gep_in = getelementptr inbounds i32, ptr addrspace(1) %input, i32 %tid
+  %val = load i32, ptr addrspace(1) %gep_in, align 4
+  %cond1 = icmp sgt i32 %val, 0
+  br i1 %cond1, label %bb1, label %merge1, !prof !15
+
+bb1:
+  %doubled = mul i32 %val, 2
+  br label %merge1
+
+merge1:
+  %result = phi i32 [ %doubled, %bb1 ], [ 0, %entry ]
+  %cond2 = icmp slt i32 %val, 100
+  br i1 %cond2, label %bb2, label %merge2, !prof !15
+
+bb2:
+  %gep_out = getelementptr inbounds i32, ptr addrspace(1) %output, i32 %tid
+  store i32 %result, ptr addrspace(1) %gep_out, align 4
+  br label %merge2
+
+merge2:
+  ret void
+}
+
+attributes #0 = { convergent nounwind readnone willreturn }
+attributes #1 = { noduplicate nounwind }
+
+!llvm.module.flags = !{!0}
+!0 = !{i32 1, !"ProfileSummary", !1}
+!1 = !{!2, !3, !4, !5, !6, !7, !8, !9}
+!2 = !{!"ProfileFormat", !"InstrProf"}
+!3 = !{!"TotalCount", i64 10000}
+!4 = !{!"MaxCount", i64 10}
+!5 = !{!"MaxInternalCount", i64 1}
+!6 = !{!"MaxFunctionCount", i64 1000}
+!7 = !{!"NumCounts", i64 3}
+!8 = !{!"NumFunctions", i64 3}
+!9 = !{!"DetailedSummary", !10}
+!10 = !{!11, !12, !13}
+!11 = !{i32 10000, i64 100, i32 1}
+!12 = !{i32 999000, i64 100, i32 1}
+!13 = !{i32 999999, i64 1, i32 2}
+
+!14 = !{!"function_entry_count", i64 100}
+!15 = !{!"branch_weights", i32 999, i32 1}