[Mlir-commits] [mlir] [mlir][OpenMP] delayed privatisation for TASK (PR #114785)

Wed Nov 6 02:46:54 PST 2024

================
@@ -1250,6 +1249,79 @@ static LogicalResult allocAndInitializeReductionVars(
   return success();
 }
 
+/// Allocate delayed private variables. Returns the basic block which comes
+/// after all of these allocations. llvm::Value * for each of these private
+/// variables are populated in llvmPrivateVars.
+template <class OP>
+static llvm::Expected<llvm::BasicBlock *>
+allocatePrivateVars(OP opInst, llvm::IRBuilderBase &builder,
+                    LLVM::ModuleTranslation &moduleTranslation,
+                    MutableArrayRef<BlockArgument> privateBlockArgs,
+                    MutableArrayRef<omp::PrivateClauseOp> privateDecls,
+                    llvm::SmallVector<llvm::Value *> &llvmPrivateVars,
+                    const llvm::OpenMPIRBuilder::InsertPointTy &allocaIP) {
+  // Allocate private vars
+  llvm::BranchInst *allocaTerminator =
+      llvm::cast<llvm::BranchInst>(allocaIP.getBlock()->getTerminator());
+  builder.SetInsertPoint(allocaTerminator);
+  assert(allocaTerminator->getNumSuccessors() == 1 &&
+         "This is an unconditional branch created by OpenMPIRBuilder");
+  llvm::BasicBlock *afterAllocas = allocaTerminator->getSuccessor(0);
+
+  // FIXME: Some of the allocation regions do more than just allocating.
+  // They read from their block argument (amongst other non-alloca things).
+  // When OpenMPIRBuilder outlines the parallel region into a different
+  // function it places the loads for live in-values (such as these block
+  // arguments) at the end of the entry block (because the entry block is
+  // assumed to contain only allocas). Therefore, if we put these complicated
+  // alloc blocks in the entry block, these will not dominate the availability
+  // of the live-in values they are using. Fix this by adding a latealloc
+  // block after the entry block to put these in (this also helps to avoid
+  // mixing non-alloca code with allocas).
+  // Alloc regions which do not use the block argument can still be placed in
+  // the entry block (therefore keeping the allocas together).
+  llvm::BasicBlock *privAllocBlock = nullptr;
+  if (!privateBlockArgs.empty())
+    privAllocBlock = splitBB(builder, true, "omp.private.latealloc");
+  for (unsigned i = 0; i < privateBlockArgs.size(); ++i) {
+    Region &allocRegion = privateDecls[i].getAllocRegion();
+
+    // map allocation region block argument
+    llvm::Value *nonPrivateVar =
+        moduleTranslation.lookupValue(opInst.getPrivateVars()[i]);
+    assert(nonPrivateVar);
+    moduleTranslation.mapValue(privateDecls[i].getAllocMoldArg(),
+                               nonPrivateVar);
+
+    // in-place convert the private allocation region
+    SmallVector<llvm::Value *, 1> phis;
+    if (privateDecls[i].getAllocMoldArg().getUses().empty()) {
+      // TODO this should use
+      // allocaIP.getBlock()->getFirstNonPHIOrDbgOrAlloca() so it goes before
+      // the code for fetching the thread id. Not doing this for now to avoid
+      // test churn.
----------------
tblah wrote:

Taking the example from the test:
```
/ CHECK-LABEL: @task..omp_par
// CHECK:       task.alloca:
// CHECK:         %[[VAL_11:.*]] = load ptr, ptr %[[VAL_12:.*]], align 8
// CHECK:         %[[VAL_13:.*]] = getelementptr { ptr }, ptr %[[VAL_11]], i32 0, i32 0
// CHECK:         %[[VAL_14:.*]] = load ptr, ptr %[[VAL_13]], align 8
// CHECK:         %[[VAL_15:.*]] = alloca i32, i64 1, align 4
```

The alloca was placed after a two loads and a getelementptr. It is better practice to put all allocas at the beginning of the entry block. This lets the alloca happen for free as the stack pointer adjustment can be folded into stack pointer adjustment for the function call. When we added the equivalent code for openmp reduction a reviewer felt strongly about this. In my opinion it is most important that the allocas happen in the entry block (so they cannot be in a loop) - which I have done as best as I can here. Getting them at the start of the entry block would be nicer but that might need a few other refactorings.

This code and comment was copied from the existing implementation for omp.parallel so that I could re-use it.

https://github.com/llvm/llvm-project/pull/114785