[Mlir-commits] [flang] [mlir] [mlir][OpenMP] rewrite conversion of privatisation for omp.parallel (PR #111844)

Mon Oct 14 09:18:29 PDT 2024

================
@@ -1421,12 +1450,57 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder,
             deferredStores, isByRef)))
       bodyGenStatus = failure();
 
+    // Apply copy region for firstprivate.
+    if (!privateBlockArgs.empty()) {
----------------
tblah wrote:

Thanks for the links.

> Could you also have a quick check/thought about whether the move from the separate callback to body gen callback will hamper the transformations that they talk about in the following presentation?

Clang doesn't use the separate privatisation callback either (this is equivalent to a no-op in OpenMPIRBuilder): https://github.com/llvm/llvm-project/blob/f3947aaa37f464e05c1a28ce871f5f982c5e2746/clang/lib/CodeGen/CGStmtOpenMP.cpp#L1818

TL;DR: I don't think this PR will make things worse but it would be good to take a closer look at any work needed to enable these optimizations in the future.

My educated guesses on some of the optimizations in the presentation:
- Forwarding communicated/shared values between parallel regions -> I am not sure if this will work for flang but it seems to work using the arguments to the outlined function, which will not be changed by my PR
- Interprocedural constant propagation for outlined parallel regions -> again, the arguments to the outlined function have not changed. The use of the argument should have only changed in the case of the bug I was fixing.
- Attribute deduction also the same as above
- Argument promotion (shared -> firstprivate -> private) -> I don't know how this works. It probably never worked for code generated by flang.
- Parallel region expansion and barrier elimination -> I imagine this should work. It should only need to look at the libcalls generated by OpenMPIRBuilder and maybe stitch some parallel regions together.
- Communication optimization and value transfer -> this should be unaffected by my patch -> it depends on how variables are passed into and out of parallel regions (by ref or by val)

I can't really comment on device code and cross host-device IPO as I'm very unfamiliar with that pipeline and IIUC there's a lot of code yet to be upstreamed.

> FYI. No action required. The folks working on llvm openmp were planning on adding and improving the openmp-opt pass. This would work well if the regions produced by both Clang and Flang are similar.

There are already a few places we diverge (e.g. the reduction init block(s)). Plus the differences that are inherent such as the much higher complexity of Fortran datatypes allowed in privatization and reduction (allocatable arrays etc). Maybe future work could unpack Fortran runtime descriptors in cases where they are not needed so that it looks more like a C array. Currently Flang reductions and privatization force arrays and allocatables to be boxed. 

The runtime call deduplication mentioned on the webpage ought to still work. We still generate runtime calls using OpenMPIRBuilder and the parallel region is still inside of the outlined function.

The webpage also mentions globalization for data shared between different threads on a GPU. I don't know much about the GPU flow.

https://github.com/llvm/llvm-project/pull/111844