[mlir] [llvm] [OpenMP][MLIR][OMPIRBuilder] Add a small optional constant alloca raise function pass to finalize, utilised in convertTarget (PR #78818)

via llvm-commits llvm-commits at lists.llvm.org
Mon Feb 5 06:51:00 PST 2024


https://github.com/agozillon updated https://github.com/llvm/llvm-project/pull/78818

>From 0a18964af61a4d538b46f64a0fcf847a94b7bfad Mon Sep 17 00:00:00 2001
From: Andrew Gozillon <Andrew.Gozillon at amd.com>
Date: Fri, 19 Jan 2024 18:17:01 -0600
Subject: [PATCH 1/2] [OpenMP][MLIR][OMPIRBuilder] Add a small optional
 constant alloca raise function pass to finalize, utilised in convertTarget

This patch seeks to add a mechanism to raise constant
(not ConstantExpr or runtime/dynamic) sized allocations
into the entry block for select functions that have been
inserted into a list for processing. This processing occurs
during the finalize call, after OutlinedInfo regions have
completed. This currently has only been utilised for
createOutlinedFunction, which is triggered for
TargetOp generation in the OpenMP MLIR dialect
lowering to LLVM-IR.

This currently is required for Target kernels generated by
createOutlinedFunction to avoid subsequent optimisation
passes doing some unintentional malformed optimisaitions
for AMD kernels (unsure if it occurs for other vendors). If
the allocas are generated inside of the kernel and are
not in the entry block and are subsequently passed to a
function this can lead to required instructions being
erased or manipulated in a way that causes the kernel
to run into a HSA access error.

This fix is related to a series of problems found in:
https://github.com/llvm/llvm-project/issues/74603

This problem primarily presents itself for Flang's HLFIR
AssignOp currently, when utilised with a scalar temporary
constant on the RHS and a descriptor type on the LHS. It
will generate a call to a runtime function, wrap the RHS
temporary in a newly allocated descriptor (an llvm
struct), and pass both the LHS and RHS descriptor into
the runtime function call. This will currently be
embedded into the middle of the target region in the
user entry block, which means the allocas are also
embedded in the middle, which seems to pose
issues when later passes are executed. This issue
may present itself in other HLFIR operations or
unrelated operations that generate allocas as a by
product, but for the moment, this one test case is
the only scenario i've found this problem.

Perhaps this is not the appropriate fix, I am very open to other
suggestions, I've tried a few others (at varying levels of the
flang/mlir compiler flow), but this one is the smallest and least
intrusive changeset. The other two, that come to mind (but I've
not fully looked into, the former I tried a little with blocks but it
had a few issues I'd need to think through):
*  Having a proper alloca only block (or region) generated for TargetOps
   that we could merge into the entry block that's generated by
   convertTarget's createOutlinedFunction.
* Or diverging a little from Clang's current target generation and using
  the CodeExtractor to generate the user code as an outlined function
  region invoked from the kernel we make, with our kernel arguments
  passed into it. Similar to the current parallel generation. I am not sure
  how well this would intermingle with the existing parallel generation
  though that's layered in.

Both of these methods seem like quite a divergeance from the current
status quo, which I am not entirely sure is meritted for the small test
this change aims to fix.
---
 .../llvm/Frontend/OpenMP/OMPIRBuilder.h       | 12 +++++
 llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp     | 52 +++++++++++++++++++
 .../omptarget-constant-alloca-raise.mlir      | 43 +++++++++++++++
 3 files changed, 107 insertions(+)
 create mode 100644 mlir/test/Target/LLVMIR/omptarget-constant-alloca-raise.mlir

diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
index 2288969ecc95c..6526c1dc7b7c4 100644
--- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
+++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
@@ -1504,6 +1504,11 @@ class OpenMPIRBuilder {
   /// Collection of regions that need to be outlined during finalization.
   SmallVector<OutlineInfo, 16> OutlineInfos;
 
+  /// A collection of candidate target functions that's constant allocas will
+  /// attempt to be raised on a call of finalize after all currently enqueued
+  /// outline info's have been processed.
+  SmallVector<llvm::Function *, 16> ConstantAllocaRaiseCandidates;
+
   /// Collection of owned canonical loop objects that eventually need to be
   /// free'd.
   std::forward_list<CanonicalLoopInfo> LoopInfos;
@@ -1511,6 +1516,13 @@ class OpenMPIRBuilder {
   /// Add a new region that will be outlined later.
   void addOutlineInfo(OutlineInfo &&OI) { OutlineInfos.emplace_back(OI); }
 
+  /// Add a function that's constant allocas will attempt to be raised on a
+  /// call of finalize after all currently enqueued outline info's have been
+  /// processed.
+  void addConstantAllocaRaiseCandidates(Function *F) {
+    ConstantAllocaRaiseCandidates.emplace_back(F);
+  }
+
   /// An ordered map of auto-generated variables to their unique names.
   /// It stores variables with the following names: 1) ".gomp_critical_user_" +
   /// <critical_section_name> + ".var" for "omp critical" directives; 2)
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 02b333e9ccd56..7723dfb509266 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -633,6 +633,30 @@ Function *OpenMPIRBuilder::getOrCreateRuntimeFunctionPtr(RuntimeFunction FnID) {
 
 void OpenMPIRBuilder::initialize() { initializeTypes(M); }
 
+static void raiseUserConstantDataAllocasToEntryBlock(IRBuilderBase &Builder,
+                                                     Function *Function) {
+  BasicBlock &EntryBlock = Function->getEntryBlock();
+  Instruction *MoveLocInst = EntryBlock.getFirstNonPHI();
+
+  // Loop over blocks looking for allocas, skip the entry block allocas here are
+  // in the appropriate place.
+  for (auto Block = std::next(Function->begin(), 1); Block != Function->end();
+       Block++) {
+    for (auto Inst = Block->getReverseIterator()->begin();
+         Inst != Block->getReverseIterator()->end();) {
+      if (auto *AllocaInst =
+              llvm::dyn_cast_if_present<llvm::AllocaInst>(Inst)) {
+        Inst++;
+        if (!isa<ConstantData>(AllocaInst->getArraySize()))
+          continue;
+        AllocaInst->moveBeforePreserving(MoveLocInst);
+      } else {
+        Inst++;
+      }
+    }
+  }
+}
+
 void OpenMPIRBuilder::finalize(Function *Fn) {
   SmallPtrSet<BasicBlock *, 32> ParallelRegionBlockSet;
   SmallVector<BasicBlock *, 32> Blocks;
@@ -737,6 +761,28 @@ void OpenMPIRBuilder::finalize(Function *Fn) {
   // Remove work items that have been completed.
   OutlineInfos = std::move(DeferredOutlines);
 
+  // The createTarget functions embeds user written code into
+  // the target region which may inject allocas which need to
+  // be moved to the entry block of our target or risk malformed
+  // optimisations by later passes, this is only relevant for
+  // the device pass which appears to be a little more delicate
+  // when it comes to optimisations (however, we do not block on
+  // that here, it's up to the inserter to the list to do so).
+  // This notbaly has to occur after the OutlinedInfo candidates
+  // have been extracted so we have an end product that will not
+  // be implicitly adversely affected by any raises unless
+  // intentionally appended to the list.
+  // NOTE: This only does so for ConstantData, it could be extended
+  // to ConstantExpr's with further effort, however, they should
+  // largely be folded when they get here. Extending it to runtime
+  // defined/read+writeable allocation sizes would be non-trivial
+  // (need to factor in movement of any stores to variables the
+  // allocation size depends on, as well as the usual loads,
+  // otherwise it'll yield the wrong result after movement) and
+  // likely be more suitable as an LLVM optimisation pass.
+  for (Function *F : ConstantAllocaRaiseCandidates)
+    raiseUserConstantDataAllocasToEntryBlock(Builder, F);
+
   EmitMetadataErrorReportFunctionTy &&ErrorReportFn =
       [](EmitMetadataErrorKind Kind,
          const TargetRegionEntryInfo &EntryInfo) -> void {
@@ -5043,6 +5089,12 @@ static Function *createOutlinedFunction(
 
   BasicBlock *UserCodeEntryBB = Builder.GetInsertBlock();
 
+  // As we embed the user code in the middle of our target region after we
+  // generate entry code, we must move what allocas we can into the entry
+  // block to avoid possible breaking optimisations for device
+  if (OMPBuilder.Config.isTargetDevice())
+    OMPBuilder.addConstantAllocaRaiseCandidates(Func);
+
   // Insert target deinit call in the device compilation pass.
   Builder.restoreIP(CBFunc(Builder.saveIP(), Builder.saveIP()));
   if (OMPBuilder.Config.isTargetDevice())
diff --git a/mlir/test/Target/LLVMIR/omptarget-constant-alloca-raise.mlir b/mlir/test/Target/LLVMIR/omptarget-constant-alloca-raise.mlir
new file mode 100644
index 0000000000000..be521fbe1f01a
--- /dev/null
+++ b/mlir/test/Target/LLVMIR/omptarget-constant-alloca-raise.mlir
@@ -0,0 +1,43 @@
+// RUN: mlir-translate -mlir-to-llvmir %s | FileCheck %s
+
+// A small condensed version of a problem requiring constant alloca raising in
+// Target Region Entries for user injected code, found in an issue in the Flang
+// compiler. Certain LLVM IR optimisation passes will perform runtime breaking 
+// transformations on allocations not found to be in the entry block, current
+// OpenMP dialect lowering of TargetOp's will inject user allocations after
+// compiler generated entry code, in a seperate block, this test checks that
+// a small function which attempts to raise some of these (specifically 
+// constant sized) allocations performs its task reasonably in these 
+// scenarios. 
+
+module attributes {omp.is_target_device = true} {
+  llvm.func @_QQmain() attributes {omp.declare_target = #omp.declaretarget<device_type = (host), capture_clause = (to)>} {
+    %1 = llvm.mlir.constant(1 : i64) : i64
+    %2 = llvm.alloca %1 x !llvm.struct<(ptr)> : (i64) -> !llvm.ptr
+    %3 = omp.map_info var_ptr(%2 : !llvm.ptr, !llvm.struct<(ptr)>) map_clauses(tofrom) capture(ByRef) -> !llvm.ptr
+    omp.target map_entries(%3 -> %arg0 : !llvm.ptr) {
+    ^bb0(%arg0: !llvm.ptr):
+      %4 = llvm.mlir.constant(1 : i32) : i32
+      %5 = llvm.alloca %4 x !llvm.struct<(ptr)> {alignment = 8 : i64} : (i32) -> !llvm.ptr
+      %6 = llvm.mlir.constant(50 : i32) : i32
+      %7 = llvm.mlir.constant(1 : i64) : i64
+      %8 = llvm.alloca %7 x i32 : (i64) -> !llvm.ptr
+      llvm.store %6, %8 : i32, !llvm.ptr
+      %9 = llvm.mlir.undef : !llvm.struct<(ptr)>
+      %10 = llvm.insertvalue %8, %9[0] : !llvm.struct<(ptr)> 
+      llvm.store %10, %5 : !llvm.struct<(ptr)>, !llvm.ptr
+      %88 = llvm.call @_ExternalCall(%arg0, %5) : (!llvm.ptr, !llvm.ptr) -> !llvm.struct<()>
+      omp.terminator
+    }
+    llvm.return
+  }
+  llvm.func @_ExternalCall(!llvm.ptr, !llvm.ptr) -> !llvm.struct<()>
+}
+
+// CHECK:      define weak_odr protected void @{{.*}}QQmain_l{{.*}}({{.*}}, {{.*}}) {
+// CHECK-NEXT: entry:
+// CHECK-NEXT:  %[[MOVED_ALLOCA1:.*]] = alloca { ptr }, align 8
+// CHECK-NEXT:  %[[MOVED_ALLOCA2:.*]] = alloca i32, i64 1, align 4
+// CHECK-NEXT:  %[[MAP_ARG_ALLOCA:.*]] = alloca ptr, align 8
+
+// CHECK: user_code.entry:                                  ; preds = %entry

>From f568ba2ab00a42bb5fdc7599ce60f03955605f4e Mon Sep 17 00:00:00 2001
From: Andrew Gozillon <Andrew.Gozillon at amd.com>
Date: Mon, 5 Feb 2024 08:49:53 -0600
Subject: [PATCH 2/2] [NFC] Small update based on reviewer comments

- Fix badly written comment
- Remove unneccesary function
- remove unrequired llvm:: prefix
---
 llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h | 7 -------
 llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp        | 9 ++++-----
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
index 6526c1dc7b7c4..448fcd25ce76b 100644
--- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
+++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
@@ -1516,13 +1516,6 @@ class OpenMPIRBuilder {
   /// Add a new region that will be outlined later.
   void addOutlineInfo(OutlineInfo &&OI) { OutlineInfos.emplace_back(OI); }
 
-  /// Add a function that's constant allocas will attempt to be raised on a
-  /// call of finalize after all currently enqueued outline info's have been
-  /// processed.
-  void addConstantAllocaRaiseCandidates(Function *F) {
-    ConstantAllocaRaiseCandidates.emplace_back(F);
-  }
-
   /// An ordered map of auto-generated variables to their unique names.
   /// It stores variables with the following names: 1) ".gomp_critical_user_" +
   /// <critical_section_name> + ".var" for "omp critical" directives; 2)
diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 7723dfb509266..a639439c893cd 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -638,14 +638,13 @@ static void raiseUserConstantDataAllocasToEntryBlock(IRBuilderBase &Builder,
   BasicBlock &EntryBlock = Function->getEntryBlock();
   Instruction *MoveLocInst = EntryBlock.getFirstNonPHI();
 
-  // Loop over blocks looking for allocas, skip the entry block allocas here are
-  // in the appropriate place.
+  // Loop over blocks looking for constant allocas, skipping the entry block
+  // as any allocas there are already in the desired location.
   for (auto Block = std::next(Function->begin(), 1); Block != Function->end();
        Block++) {
     for (auto Inst = Block->getReverseIterator()->begin();
          Inst != Block->getReverseIterator()->end();) {
-      if (auto *AllocaInst =
-              llvm::dyn_cast_if_present<llvm::AllocaInst>(Inst)) {
+      if (auto *AllocaInst = dyn_cast_if_present<llvm::AllocaInst>(Inst)) {
         Inst++;
         if (!isa<ConstantData>(AllocaInst->getArraySize()))
           continue;
@@ -5093,7 +5092,7 @@ static Function *createOutlinedFunction(
   // generate entry code, we must move what allocas we can into the entry
   // block to avoid possible breaking optimisations for device
   if (OMPBuilder.Config.isTargetDevice())
-    OMPBuilder.addConstantAllocaRaiseCandidates(Func);
+    OMPBuilder.ConstantAllocaRaiseCandidates.emplace_back(Func);
 
   // Insert target deinit call in the device compilation pass.
   Builder.restoreIP(CBFunc(Builder.saveIP(), Builder.saveIP()));



More information about the llvm-commits mailing list