[PATCH] D88706: [OpenMP][MLIR] WIP : Fix for nested parallel region

Fri Oct 2 13:47:52 PDT 2020

fghanim added a comment.

> While debugging the nested parallel region issues, we saw some difference between where the allocas are placed by the OpenMP IRBuilder in the clang usage and the MLIR usage. Moving to the version where the OpenMP IRBuilder maintains allocaIP fixed the difference. In MLIR module translation there is no alloca insertion point. So what we can provide as allocaIP is the current insertionPoint.

Yes, this is the expected behavior. in clang we follow the convention of setting the insertion point for `alloca`s to proper entry block, and let clang handle all of the none OMP code generation, and we let clang pass to us where it wants us to put its `alloca` instructions. Here, you are passing the current insertion point of your IRBuilder at the time the `bodyCB` is called, and all `alloca`s are output there.

> Is that OK?

Depends on your goal. is it illegal? no it is not. you can put allocas wherever you want
Does it adversely affect performance as you point out in a later comment? absolutely
Is it possible to have an effect on correctness? possible. Please check below for why I believe it may.

> What is the allocaIP used for?

to specify where you want all `alloca`s to go.

> Why do we need a separate allocaIP, why cannot we treat it like a normal instruction?

two reasons:

1. there are allocas that go inside the outlined region, and allocas that go outside of it, and we need to be able to chose when to codegen each.
2. We try to generate `alloca`s into the entry block of the relevant function. and treating it like a normal instruction means we cannot.

> Is it because all alloca instructions should be together at the top of the function?

yes, but as I mention above, not just that.

> The langref for alloca did not have such a requirement.

It is a requirement for optimization and possible the backend, not for correctness of IR which is what the langref is concerned about.

> I have created another review without the alloca changes and that also works correctly for the nested parallel case.
> https://reviews.llvm.org/D88720

I saw that this patch resolved the problem you were seeing. I am glad it did. It worked, because,  as I and JDoerfert pointed out in the original patch, your problem had nothing to do with `alloca` locations but has everything to do with your outlined region not being completely contained within the entry and exit blocks of the `parallel` region. D88720 <https://reviews.llvm.org/D88720> seems to have fixed that.

> I belive all allocas in the LLVM dialect will also be in the entry block of the OpenMP operation. But these will added only in the bodygen call-back. So they will be added to omp.par.region (actually omp.par.region1 since a dummy branch is created). But all these are trivial branches (not conditional) and can't they be inlined into the entry block if required? See example below for,
> //{ ... code ...}//

Sure. but what happens when you generate `Copyin` for example? the allocas in the `body` are very likely going to end up after the `copyin` CFG structure, which means these are not going to be in the entry block. (to get what I mean check the `createcopyinblocks` in the `OMPBuilder`)
Furthermore, while there is the `CFGSimplify` pass which should remove all extra branches, are you guaranteed to run that everytime? what happens when you run the frontend with -O0?

Here is a question regarding allocas in the llvm dialect; by the time they are translated to llvm ir proper, are they already located in the entry block in the dialect itself, or are they located in different places but you have something like an `AllocaIP` where you move allocas to entry block during the translation to proper llvm IR?
What is FIR planning to do about their stack allocations? if they are not guaranteeing it's in the entry BB then that's an issue.
is OMP-IR meant to also work alongside a future C/C++ dialect? If yes, then clang people are very unlikely to be ok with not having `alloca`s in the entry block, and OMP-IR needs to follow the conventions of the relevant frontend when it comes variable declarations, etc., regardless of what that frontend is.

> Found that they should be in the entry block for optimizations. @fghanim is this what you are suggesting?
> http://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas

`mem2reg` pass will look for all `alloca`s in a function and move them into entry block. However, when you run a frontend with -O0, we don't run any optimization passes. I remember working with llvm passes that ignored `Alloca`s not in the entry block. so yes, it is about optimizations but not only that. Many of the various backends take advantage of the fact that `alloca`s are in the entry block to reserve stack space upon entry into a function. Unless you can guarantee that everyone of those has a way to handle not having all `alloca`s in entry block, we should guarantee that for them - at which point it's a correctness issue.

---

To be clear; I am **NOT** against keeping track of `Alloca` insertion points in the OMPBuilder if there is a reason to. As can be seen in D82470 <https://reviews.llvm.org/D82470> where I suggested multiple other ways to do so. However, I have a huge problem with creating a special IRBuilder just to do so. 
I do have a small preference towards not keeping state of anything that we don't need, just out of consistency with other IRBuilders (i.e. you tell an IRBuilder what do you want it to do and where to do it, rather than it knowing that).

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88706/new/

https://reviews.llvm.org/D88706