[Mlir-commits] [llvm] [mlir] [OpenMP][MLIR] Fix GPU teams reduction buffer size for by-ref reductions (PR #185460)

Mon Mar 9 09:59:25 PDT 2026

https://github.com/jeffreysandoval created https://github.com/llvm/llvm-project/pull/185460

The `ReductionDataSize` field in `KernelEnvironmentTy` and the `MaxDataSize` used to compute the `reduce_data_size` argument to `__kmpc_nvptx_teams_reduce_nowait_v2` were both computed using pointer types for by-ref reductions instead of the actual element types. This caused the global teams reduction buffer to be undersized relative to the offsets used by the copy/reduce callbacks, resulting in out-of-bounds accesses faults at runtime.

For example, a by-ref reduction over `[4 x i32]` (16 bytes) would allocate buffer slots based on `sizeof(ptr)` = 8 bytes, but the generated callbacks would access 16 bytes per slot.

Fix both computation sites:

1. In MLIR's `getReductionDataSize()`, use `DeclareReductionOp::getByrefElementType()` instead of `getType()` when the reduction is by-ref, so the reduction buffer struct layout (and more importantly its size) matches that emitted by the `OMPIRBuilder`.

2. In `OMPIRBuilder::createReductionsGPU()`, use `ReductionInfo::ByRefElementType` instead of `ElementType` for by-ref reductions when computing `MaxDataSize`.  It seems that `MaxDataSize` isn't actually used in the deviceRTL, but it's better to fix it to avoid future propagation of this bug.

Finally, add CHECK lines to the existing array-descriptor reduction test to verify both the kernel environment `ReductionDataSize` and the `reduce_data_size` call argument reflect the actual element type size.

>From 4f816f6d9a3fe8e12f6c030aa7e7d7272596c9f7 Mon Sep 17 00:00:00 2001
From: Jeffrey Sandoval <jeffrey.sandoval at hpe.com>
Date: Mon, 9 Mar 2026 11:01:12 -0500
Subject: [PATCH] [OpenMP][MLIR] Fix GPU teams reduction buffer size for by-ref
 reductions

The `ReductionDataSize` field in `KernelEnvironmentTy` and the
`MaxDataSize` used to compute the `reduce_data_size` argument to
`__kmpc_nvptx_teams_reduce_nowait_v2` were both computed using pointer
types for by-ref reductions instead of the actual element types. This
caused the global teams reduction buffer to be undersized relative to
the offsets used by the copy/reduce callbacks, resulting in
out-of-bounds accesses faults at runtime.

For example, a by-ref reduction over `[4 x i32]` (16 bytes) would
allocate buffer slots based on `sizeof(ptr)` = 8 bytes, but the
generated callbacks would access 16 bytes per slot.

Fix both computation sites:

1. In MLIR's `getReductionDataSize()`, use
   `DeclareReductionOp::getByrefElementType()` instead of `getType()`
   when the reduction is by-ref, so the reduction buffer struct layout
   (and more importantly its size) matches that emitted by the
   `OMPIRBuilder`.

2. In `OMPIRBuilder::createReductionsGPU()`, use
   `ReductionInfo::ByRefElementType` instead of `ElementType` for
   by-ref reductions when computing `MaxDataSize`.  It seems that
   `MaxDataSize` isn't actually used in the deviceRTL, but it's better
   to fix it to avoid future propagation of this bug.

Finally, add CHECK lines to the existing array-descriptor reduction
test to verify both the kernel environment `ReductionDataSize` and the
`reduce_data_size` call argument reflect the actual element type size.
---
 llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp     | 19 ++++++++++++++++++-
 .../OpenMP/OpenMPToLLVMIRTranslation.cpp      | 11 +++++++++--
 ...distribute-reduction-array-descriptor.mlir | 13 +++++++++++++
 3 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 1d522e9b14c7a..5be83068b3544 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -4375,10 +4375,27 @@ OpenMPIRBuilder::InsertPointOrErrorTy OpenMPIRBuilder::createReductionsGPU(
 
   Value *RL = Builder.CreatePointerBitCastOrAddrSpaceCast(ReductionList, PtrTy);
 
+  // NOTE: ReductionDataSize is passed as the reduce_data_size
+  // argument to __kmpc_nvptx_{parallel,teams}_reduce_nowait_v2, but
+  // the runtime implementations do not currently use it.  The teams
+  // runtime reads ReductionDataSize from KernelEnvironmentTy instead
+  // (set separately via TargetKernelDefaultAttrs).  It is computed
+  // here conservatively as max(element sizes) * N rather than the
+  // exact sum, which over-calculates the size for mixed reduction
+  // types but is harmless given the argument is unused.
+  // TODO: Consider dropping this computation if the runtime API is
+  // ever revised to remove the unused parameter.
   unsigned MaxDataSize = 0;
   SmallVector<Type *> ReductionTypeArgs;
   for (auto En : enumerate(ReductionInfos)) {
-    auto Size = M.getDataLayout().getTypeStoreSize(En.value().ElementType);
+    // Use ByRefElementType for by-ref reductions so that MaxDataSize matches
+    // the actual data size stored in the global reduction buffer, consistent
+    // with the ReductionsBufferTy struct used for GEP offsets below.
+    Type *SizeType =
+        (!IsByRef.empty() && IsByRef[En.index()] && En.value().ByRefElementType)
+            ? En.value().ByRefElementType
+            : En.value().ElementType;
+    auto Size = M.getDataLayout().getTypeStoreSize(SizeType);
     if (Size > MaxDataSize)
       MaxDataSize = Size;
     Type *RedTypeArg = (!IsByRef.empty() && IsByRef[En.index()])
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 3e7a6c88aec3a..0d77e1a1a9c6a 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -6269,8 +6269,15 @@ static uint64_t getReductionDataSize(OpTy &op) {
 
     llvm::SmallVector<mlir::Type> members;
     members.reserve(reductions.size());
-    for (omp::DeclareReductionOp &red : reductions)
-      members.push_back(red.getType());
+    for (omp::DeclareReductionOp &red : reductions) {
+      // For by-ref reductions, use the actual element type rather than the
+      // pointer type so that the buffer size matches the access pattern in
+      // the copy/reduce callbacks generated by OMPIRBuilder.
+      if (red.getByrefElementType())
+        members.push_back(*red.getByrefElementType());
+      else
+        members.push_back(red.getType());
+    }
     Operation *opp = op.getOperation();
     auto structType = mlir::LLVM::LLVMStructType::getLiteral(
         opp->getContext(), members, /*isPacked=*/false);
diff --git a/mlir/test/Target/LLVMIR/omptarget-teams-distribute-reduction-array-descriptor.mlir b/mlir/test/Target/LLVMIR/omptarget-teams-distribute-reduction-array-descriptor.mlir
index 84b4a0e71c36f..663b78261e06c 100644
--- a/mlir/test/Target/LLVMIR/omptarget-teams-distribute-reduction-array-descriptor.mlir
+++ b/mlir/test/Target/LLVMIR/omptarget-teams-distribute-reduction-array-descriptor.mlir
@@ -48,6 +48,16 @@ module attributes {dlti.dl_spec = #dlti.dl_spec<"dlti.alloca_memory_space" = 5 :
   }
 }
 
+// Verify kernel environment has correct ReductionDataSize for by-ref array
+// reduction.  The by-ref element type is [4 x i32] = 16 bytes, so the
+// struct should be {[4 x i32]} = 16 bytes.  Failing to account for the by-ref
+// indirection would result in a struct of {ptr} = 8 bytes.
+// AMDGCN: @{{.*}}_kernel_environment = {{.*}} %struct.ConfigurationEnvironmentTy { {{.*}}i32 16, i32 1024 }
+
+// Verify the reduce_data_size argument to __kmpc_nvptx_teams_reduce_nowait_v2
+// matches the by-ref element type size (16), not the pointer size (8).
+// AMDGCN: call i32 @__kmpc_nvptx_teams_reduce_nowait_v2({{.*}}, i32 1024, i64 16,
+
 // Verify descriptor is copied via memcpy and base_ptr is updated in all helpers
 // AMDGCN-LABEL: define internal void @_omp_reduction_shuffle_and_reduce_func
 // AMDGCN: call void @llvm.memcpy{{.*}}(ptr {{.*}}, ptr {{.*}}, i64 {{[0-9]+}}, i1 false)
@@ -111,6 +121,9 @@ module attributes {llvm.target_triple = "nvptx64-nvidia-cuda", omp.is_gpu = true
   }
 }
 
+// NVPTX: @{{.*}}_kernel_environment = {{.*}} %struct.ConfigurationEnvironmentTy { {{.*}}i32 16, i32 1024 }
+// NVPTX: call i32 @__kmpc_nvptx_teams_reduce_nowait_v2({{.*}}, i32 1024, i64 16,
+
 // Verify descriptor is copied via memcpy and base_ptr is updated in all helpers
 // NVPTX-LABEL: define internal void @_omp_reduction_shuffle_and_reduce_func
 // NVPTX: call void @llvm.memcpy{{.*}}(ptr {{.*}}, ptr {{.*}}, i64 {{[0-9]+}}, i1 false)