[mlir] [llvm] [OpenMP][MLIR][OMPIRBuilder] Add a small optional constant alloca raise function pass to finalize, utilised in convertTarget (PR #78818)

Tue Feb 6 08:31:58 PST 2024

agozillon wrote:

> > The relevant location is inside of the TargetOp, the rest is largely irrelevant for device but included for completeness. The HLFIR AssignOp will generate an allocation as part of it's lowering from my understanding (in this particular use case at least, there's different lowering's for it). Although, I wouldn't say the problem lies with the HLFIR operation necessarily, more just a side affect of how we lower to a target region for LLVM-IR being shown I think (maybe there's a way to make it more AMDGPU runtime friendly, or a different runtime function for AMDGPU, but I'm doubtful that's the ideal solution). Even if we make sure to raise all AllocaOp's (when we lower to the LLVM Dialect this is done currently, but TargetOp doesn't generate a new Blocks, so there is no isolated Entry Block as such) to the top this would still persist, as we inevitably embed the user code into a seperate block in-between some kernel entry code for the arguments that will branch off to a fail condition or the user code block containing the allocations, and then a later pass (not sure which yet unfortunately) will try to do some magic and end up breaking the generated executable.
> 
> Is your observation that the OpenMPIRBuilder creates a separate `Alloca` block and this `Alloca` block will become the entry block in LLVM IR. So even if we move all the allocas to the entry block in FIR/HLFIR/LLVM dialect MLIR, it still won't be the entry block in LLVM IR?

That's correct unfortunately with our current lowering method (which I believe is just how we traditionally generate the kernels from what I can gather from code comments) we generate an initial entry point that does some load in/setup of kernel input and then performs a libomptarget runtime invocation for initialization (__kmpc_target_init) before branching into a fail condition or the user entry code, which is effectively the lowered MLIR code in the region. So any allocas in the region will still be emitted outside of the alloca block currently. 

> Is it possible to merge the `Alloca` block to the succeeding block (which is hopefully the entry block at the MLIR dialects)? And if we have lifted in MLIR all to the entry block this will all work out fine?

I thought about doing this, but there's a few issues (some may be simpler to overcome than I think or I maybe my thinking is just bad) I thought of that might get in the way of it:

1) I don't think we can just merge the MLIR generated block into the front of the newly generated OpenMPIRBuilder target block as we could end up with out of synch operations, if anything in the MLIR generated block depends on any of the input arguments. 
2) The inverse of merging at the bottom of the block becomes a bit of an issue due to the kmpc_target_init call into a branch, as we can't meld it directly behind the branch. 
3) We could maybe splice it into the OpenMPIRBuilder generated entry block, above the call to kmpc_target_init and branch below the argument entry code. However, we actually do not really have the concept of an strict alloca block when it comes to TargetOp (or any of the OpenMP region constructs from what I can tell), we simply prepend allocas to the top of the target region, and this primarily seems to be enforced in the FIR -> LLVM dialect lowering, the HLFIR/FIR CodeGen seems less strict on this. Enforcing a specific block at the FIR -> LLVM dialect lowering level didn't seem particularly hard when I tried it (but I didn't test it extensively). However, as I recall after doing this I found that it's not so easy to reliably find this alloca block when we're trying to merge them, as we can' t guarantee the first generated LLVM-IR block is the alloca block. So we'd need some way to identify the block for merging. 

We also need to be a tad careful when doing this as we have the nested parallel operations to consider that also do their own alloca movements and extraction, it's why I opted to defer the raising to the finalize method and restrict it solely to the Target kernel entry function. 

It's been a little bit since I looked into this in detail now, so hopefully I am recollecting the above well enough! I am happy to look into it again to recollect more details if that's something we'd be interested in doing. 

https://github.com/llvm/llvm-project/pull/78818