[Openmp-commits] [llvm] [openmp] [offload][OpenMP] Fix partial warp reduction (PR #205861)

Thu Jun 25 10:06:23 PDT 2026

https://github.com/ro-i created https://github.com/llvm/llvm-project/pull/205861

Don't use a full warp for the final cross-team reduction if the reduction's thread limit is below the warp size.

Claude assisted with the test.

>From 5c8dd4e401f1fc26ec7659cb32dc9acd8eb8d40e Mon Sep 17 00:00:00 2001
From: Robert Imschweiler <robert.imschweiler at amd.com>
Date: Thu, 25 Jun 2026 12:00:23 -0500
Subject: [PATCH] [offload][OpenMP] Fix partial warp reduction

Don't use a full warp for the final cross-team reduction if the
reduction's thread limit is below the warp size.

Claude assisted with the test.
---
 .../xteam_min_reduction_partial_wave.c        | 41 +++++++++++++++++++
 openmp/device/src/Reduction.cpp               |  2 +-
 2 files changed, 42 insertions(+), 1 deletion(-)
 create mode 100644 offload/test/offloading/xteam_min_reduction_partial_wave.c

diff --git a/offload/test/offloading/xteam_min_reduction_partial_wave.c b/offload/test/offloading/xteam_min_reduction_partial_wave.c
new file mode 100644
index 0000000000000..cb381e97fdc7b
--- /dev/null
+++ b/offload/test/offloading/xteam_min_reduction_partial_wave.c
@@ -0,0 +1,41 @@
+// RUN: %libomptarget-compile-run-and-check-generic
+// RUN: %libomptarget-compileopt-run-and-check-generic
+
+// REQUIRES: gpu
+// UNSUPPORTED: intelgpu
+
+// Regression test for a cross-team reduction bug where the final xteam
+// reduction used a full-wave reduction path even though the kernel was launched
+// with fewer threads than the device wave size.  On wave64 AMDGPU targets,
+// thread_limit(32) creates a partial wave.  The 63-team case uses the
+// single-thread final reduction path; the 64-team case crosses the wave-size
+// boundary and must still ignore inactive lanes.
+
+#include <limits.h>
+#include <stdio.h>
+
+static unsigned reduce_min(int teams, int seed) {
+  unsigned min_val = UINT_MAX;
+
+#pragma omp target teams distribute parallel for num_teams(teams)               \
+    thread_limit(32) map(to : seed) reduction(min : min_val)
+  for (int i = 0; i < 2017; ++i) {
+    unsigned val = 0xdeadbeefU + ((i + seed) & 1);
+    if (val < min_val)
+      min_val = val;
+  }
+
+  return min_val;
+}
+
+int main(int argc, char **argv) {
+  unsigned min63 = reduce_min(63, argc);
+  unsigned min64 = reduce_min(64, argc);
+
+  // CHECK: min63 = 0xdeadbeef
+  // CHECK: min64 = 0xdeadbeef
+  printf("min63 = %#x\n", min63);
+  printf("min64 = %#x\n", min64);
+
+  return min63 == 0xdeadbeefU && min64 == 0xdeadbeefU ? 0 : 1;
+}
diff --git a/openmp/device/src/Reduction.cpp b/openmp/device/src/Reduction.cpp
index ec772d357a425..df444fb9d90e4 100644
--- a/openmp/device/src/Reduction.cpp
+++ b/openmp/device/src/Reduction.cpp
@@ -280,7 +280,7 @@ int32_t __kmpc_gpu_xteam_reduce_nowait(IdentTy *Loc, void *reduce_data,
     return 0;
 
   // The last team performs final reduction across all team values.
-  NumThreads = kmpc_min(NumThreads, round_down_to_warpsize(NumTeams));
+  NumThreads = round_down_to_warpsize(kmpc_min(NumThreads, NumTeams));
   if (ThreadId >= NumThreads)
     return 0;