[llvm] [mlir] [OpenMP][MLIR] Fix GPU teams reduction buffer size for by-ref reductions (PR #185460)

Mon Mar 9 10:00:21 PDT 2026

llvmbot wrote:



@llvm/pr-subscribers-flang-openmp
@llvm/pr-subscribers-mlir-openmp

@llvm/pr-subscribers-mlir

Author: Jeff Sandoval (jeffreysandoval)

<details>
<summary>Changes</summary>

The `ReductionDataSize` field in `KernelEnvironmentTy` and the `MaxDataSize` used to compute the `reduce_data_size` argument to `__kmpc_nvptx_teams_reduce_nowait_v2` were both computed using pointer types for by-ref reductions instead of the actual element types. This caused the global teams reduction buffer to be undersized relative to the offsets used by the copy/reduce callbacks, resulting in out-of-bounds accesses faults at runtime.

For example, a by-ref reduction over `[4 x i32]` (16 bytes) would allocate buffer slots based on `sizeof(ptr)` = 8 bytes, but the generated callbacks would access 16 bytes per slot.

Fix both computation sites:

1. In MLIR's `getReductionDataSize()`, use `DeclareReductionOp::getByrefElementType()` instead of `getType()` when the reduction is by-ref, so the reduction buffer struct layout (and more importantly its size) matches that emitted by the `OMPIRBuilder`.

2. In `OMPIRBuilder::createReductionsGPU()`, use `ReductionInfo::ByRefElementType` instead of `ElementType` for by-ref reductions when computing `MaxDataSize`.  It seems that `MaxDataSize` isn't actually used in the deviceRTL, but it's better to fix it to avoid future propagation of this bug.

Finally, add CHECK lines to the existing array-descriptor reduction test to verify both the kernel environment `ReductionDataSize` and the `reduce_data_size` call argument reflect the actual element type size.

---
Full diff: https://github.com/llvm/llvm-project/pull/185460.diff


3 Files Affected:

- (modified) llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (+18-1) 
- (modified) mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp (+9-2) 
- (modified) mlir/test/Target/LLVMIR/omptarget-teams-distribute-reduction-array-descriptor.mlir (+13) 


``````````diff

diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 1d522e9b14c7a..5be83068b3544 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -4375,10 +4375,27 @@ OpenMPIRBuilder::InsertPointOrErrorTy OpenMPIRBuilder::createReductionsGPU(
 
   Value *RL = Builder.CreatePointerBitCastOrAddrSpaceCast(ReductionList, PtrTy);
 
+  // NOTE: ReductionDataSize is passed as the reduce_data_size
+  // argument to __kmpc_nvptx_{parallel,teams}_reduce_nowait_v2, but
+  // the runtime implementations do not currently use it.  The teams
+  // runtime reads ReductionDataSize from KernelEnvironmentTy instead
+  // (set separately via TargetKernelDefaultAttrs).  It is computed
+  // here conservatively as max(element sizes) * N rather than the
+  // exact sum, which over-calculates the size for mixed reduction
+  // types but is harmless given the argument is unused.
+  // TODO: Consider dropping this computation if the runtime API is
+  // ever revised to remove the unused parameter.
   unsigned MaxDataSize = 0;
   SmallVector<Type *> ReductionTypeArgs;
   for (auto En : enumerate(ReductionInfos)) {
-    auto Size = M.getDataLayout().getTypeStoreSize(En.value().ElementType);
+    // Use ByRefElementType for by-ref reductions so that MaxDataSize matches
+    // the actual data size stored in the global reduction buffer, consistent
+    // with the ReductionsBufferTy struct used for GEP offsets below.
+    Type *SizeType =
+        (!IsByRef.empty() && IsByRef[En.index()] && En.value().ByRefElementType)
+            ? En.value().ByRefElementType
+            : En.value().ElementType;
+    auto Size = M.getDataLayout().getTypeStoreSize(SizeType);
     if (Size > MaxDataSize)
       MaxDataSize = Size;
     Type *RedTypeArg = (!IsByRef.empty() && IsByRef[En.index()])
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 3e7a6c88aec3a..0d77e1a1a9c6a 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -6269,8 +6269,15 @@ static uint64_t getReductionDataSize(OpTy &op) {
 
     llvm::SmallVector<mlir::Type> members;
     members.reserve(reductions.size());
-    for (omp::DeclareReductionOp &red : reductions)
-      members.push_back(red.getType());
+    for (omp::DeclareReductionOp &red : reductions) {
+      // For by-ref reductions, use the actual element type rather than the
+      // pointer type so that the buffer size matches the access pattern in
+      // the copy/reduce callbacks generated by OMPIRBuilder.
+      if (red.getByrefElementType())
+        members.push_back(*red.getByrefElementType());
+      else
+        members.push_back(red.getType());
+    }
     Operation *opp = op.getOperation();
     auto structType = mlir::LLVM::LLVMStructType::getLiteral(
         opp->getContext(), members, /*isPacked=*/false);
diff --git a/mlir/test/Target/LLVMIR/omptarget-teams-distribute-reduction-array-descriptor.mlir b/mlir/test/Target/LLVMIR/omptarget-teams-distribute-reduction-array-descriptor.mlir
index 84b4a0e71c36f..663b78261e06c 100644
--- a/mlir/test/Target/LLVMIR/omptarget-teams-distribute-reduction-array-descriptor.mlir
+++ b/mlir/test/Target/LLVMIR/omptarget-teams-distribute-reduction-array-descriptor.mlir
@@ -48,6 +48,16 @@ module attributes {dlti.dl_spec = #dlti.dl_spec<"dlti.alloca_memory_space" = 5 :
   }
 }
 
+// Verify kernel environment has correct ReductionDataSize for by-ref array
+// reduction.  The by-ref element type is [4 x i32] = 16 bytes, so the
+// struct should be {[4 x i32]} = 16 bytes.  Failing to account for the by-ref
+// indirection would result in a struct of {ptr} = 8 bytes.
+// AMDGCN: @{{.*}}_kernel_environment = {{.*}} %struct.ConfigurationEnvironmentTy { {{.*}}i32 16, i32 1024 }
+
+// Verify the reduce_data_size argument to __kmpc_nvptx_teams_reduce_nowait_v2
+// matches the by-ref element type size (16), not the pointer size (8).
+// AMDGCN: call i32 @__kmpc_nvptx_teams_reduce_nowait_v2({{.*}}, i32 1024, i64 16,
+
 // Verify descriptor is copied via memcpy and base_ptr is updated in all helpers
 // AMDGCN-LABEL: define internal void @_omp_reduction_shuffle_and_reduce_func
 // AMDGCN: call void @llvm.memcpy{{.*}}(ptr {{.*}}, ptr {{.*}}, i64 {{[0-9]+}}, i1 false)
@@ -111,6 +121,9 @@ module attributes {llvm.target_triple = "nvptx64-nvidia-cuda", omp.is_gpu = true
   }
 }
 
+// NVPTX: @{{.*}}_kernel_environment = {{.*}} %struct.ConfigurationEnvironmentTy { {{.*}}i32 16, i32 1024 }
+// NVPTX: call i32 @__kmpc_nvptx_teams_reduce_nowait_v2({{.*}}, i32 1024, i64 16,
+
 // Verify descriptor is copied via memcpy and base_ptr is updated in all helpers
 // NVPTX-LABEL: define internal void @_omp_reduction_shuffle_and_reduce_func
 // NVPTX: call void @llvm.memcpy{{.*}}(ptr {{.*}}, ptr {{.*}}, i64 {{[0-9]+}}, i1 false)

``````````

</details>


https://github.com/llvm/llvm-project/pull/185460