[llvm] [mlir] [OpenMP][MLIR][OMPIRBuilder] Add a small optional constant alloca raise function pass to finalize, utilised in convertTarget (PR #78818)

Tue Feb 13 12:54:39 PST 2024

================
@@ -0,0 +1,43 @@
+// RUN: mlir-translate -mlir-to-llvmir %s | FileCheck %s
+
+// A small condensed version of a problem requiring constant alloca raising in
+// Target Region Entries for user injected code, found in an issue in the Flang
+// compiler. Certain LLVM IR optimisation passes will perform runtime breaking 
+// transformations on allocations not found to be in the entry block, current
+// OpenMP dialect lowering of TargetOp's will inject user allocations after
+// compiler generated entry code, in a seperate block, this test checks that
+// a small function which attempts to raise some of these (specifically 
+// constant sized) allocations performs its task reasonably in these 
+// scenarios. 
+
+module attributes {omp.is_target_device = true} {
+  llvm.func @_QQmain() attributes {omp.declare_target = #omp.declaretarget<device_type = (host), capture_clause = (to)>} {
+    %1 = llvm.mlir.constant(1 : i64) : i64
+    %2 = llvm.alloca %1 x !llvm.struct<(ptr)> : (i64) -> !llvm.ptr
+    %3 = omp.map_info var_ptr(%2 : !llvm.ptr, !llvm.struct<(ptr)>) map_clauses(tofrom) capture(ByRef) -> !llvm.ptr
+    omp.target map_entries(%3 -> %arg0 : !llvm.ptr) {
+    ^bb0(%arg0: !llvm.ptr):
+      %4 = llvm.mlir.constant(1 : i32) : i32
+      %5 = llvm.alloca %4 x !llvm.struct<(ptr)> {alignment = 8 : i64} : (i32) -> !llvm.ptr
+      %6 = llvm.mlir.constant(50 : i32) : i32
+      %7 = llvm.mlir.constant(1 : i64) : i64
+      %8 = llvm.alloca %7 x i32 : (i64) -> !llvm.ptr
+      llvm.store %6, %8 : i32, !llvm.ptr
+      %9 = llvm.mlir.undef : !llvm.struct<(ptr)>
+      %10 = llvm.insertvalue %8, %9[0] : !llvm.struct<(ptr)> 
+      llvm.store %10, %5 : !llvm.struct<(ptr)>, !llvm.ptr
+      %88 = llvm.call @_ExternalCall(%arg0, %5) : (!llvm.ptr, !llvm.ptr) -> !llvm.struct<()>
----------------
agozillon wrote:


Thank you for the great question @kiranchandramohan and I encourage anyone else that has any thoughts on possible scenarios that this might break to also bring things up where there may be concerns. My LLVM knowledge is still very much a WIP.

In this case it would move the alloca to the entry block whether or not it's in a loop related series of blocks. From an MLIR perspective currently I don't think it's possible to have a series of loop blocks around the initial basic block of `TargetOp `(but I may be wrong on this and would love to know a scenario where this is possible to test), although, it is possible to have blocks chained from it relating to loops as normal. I believe even in the case of `omp.parallel`/`omp.wsloop` it's internal to the initial basic block, and this code actively avoids manipulating this kind of nested region currently, the entirety of the parallel do should be outlined into a seperate function by the time we try to raise allocations, the raising aims to only happen inside of the kernel entry function (`TargetOp`) currently, the parallel do has it's own allocation raising/sinking that occurs via the `CodeExtractor `that we need to avoid tampering with (as it will break it).

>From a Fortran perspective (as someone who is still very bad at writing Fortran) I am not sure a user can inject an alloca directly into the `TargetOp`, from my understanding you can only declare a variable at the top of a function or in a module. So you can't quite do what you can in C++ where as an example you declare an integer directly in the OpenMP target region which will become an alloca. 

However, as we do still need to consider this being used in C/C++ where it is possible, if we look at the below C/C++ examples where I've tried to come up with some breaking scenarios (perhaps my logic isn't quite sound and there is better examples that would cause breakage, in which case please do bring them up!):

Version 1

```
int test = 0;
#pragma omp target map(tofrom : test)
{
  for (int i = 0; i < 10; i++) {
    int j = 10;
    j += 10;
    test += j;
  }
}
```

Version 2

```
int test = 0;
#pragma omp target map(tofrom : test)
{
  int j = 10;
  for (int i = 0; i < 10; i++) {
    j += 10;
    test += j;
  }
}
```

In both cases, when we emit the device IR the alloca for `j` is moved to the entry block alongside `i`, and the correct results are achieved, so I believe it is still semantically equivalent. Without the initialization of `j` with 10 we get the same UB in both cases as well.

The main case where I can see it being an issue (and there may be others, which I would love to hear if someone can think of them so I can investigate) is if we have an allocation that is size dependent on some loop dependent variable, making it a dynamic allocation, however, this PR doesn't tackle this case (and doesn't handle constant expressions either, it needs them folded into a constant value currently), it'd be a lot more complex to deal with and Flang hasn't presented a case where we need to handle this scenario yet (hopefully never). If we get to the point of requiring this it'd require a significantly more complex solution I think.

However, disclaimer again, I am not the most LLVM savvy person yet, so anyone that may be reading, please do feel free to mention cases that it may be an issue that you can think of or if any of the above may be incorrect.  

https://github.com/llvm/llvm-project/pull/78818