[llvm-dev] About OpenMP dialect in MLIR

Thu Feb 13 10:18:20 PST 2020

Hi Vinay,

Thanks for taking an interest and the detailed discussion.

To start by picking a few paragraph from your email to clarify a couple
of things that lead to the current design or that might otherwise need
clarification. We can talk about other points later as well.

[
  Site notes:
    1) I'm not an MLIR person.
    2) It seems unfortnuate that we do not have a mlir-dev list.
]

> 1. With the current design, the number of transformations / optimizations
> that one can write on OpenMP constructs would become limited as there can
> be any custom loop structure with custom operations / types inside it.

OpenMP, as an input language, does not make many assumptions about the
code inside of constructs*. So, inside a parallel can be almost anything
the base language has to offer, both lexically and dynamically.
Assuming otherwise is not going to work. Analyzing a "generic" OpenMP
representation in order to determine if can be represented as a more
restricted "op" seems at least plausible. You will run into various
issue, some mentioned explicitly below. For starters, you still have to
generate proper OpenMP runtime calls, e.g., from your GPU dialect, even
if it is "just" to make sure the OMPD/OMPT interfaces expose useful
information.

* I preclude the `omp loop` construct here as it is not even implemented
  anywhere as far as I know.

> 2. It would also be easier to transform the Loop nests containing OpenMP
> constructs if the body of the OpenMP operations is well defined (i.e., does
> not accept arbitrary loop structures). Having nested redundant "parallel" ,
> "target" and "do" regions seems unnecessary.

As mentioned above, you cannot start with the assumption OpenMP input is
structured this this way. You have to analyze it first. This is the same
reason we cannot simply transform C/C++ `for loops` into `affine.for`
without proper analysis of the loop body.

Now, more concrete. Nested parallel and target regions are not
necessarily redundant, nor can/should we require the user not to have
them. Nested parallelism can easily make sense, depending on the problem
decomposition. Nested target will make a lot of sense with reverse
offload, which is already in the standard, and it also should be allowed
for the sake of a modular (user) code base.

> 3. There would also be new sets of loop structures in new dialects when
> C/C++ is compiled to MLIR. It would complicate the number of possible
> combinations inside the OpenMP region.

Is anyone working on this? If so, what is the timeline? I personally was
not expecting Clang to switch over to MLIR any time soon but I am happy
if someone wants to correct me on this. I mention this only because it
interacts with the arguments I will make below.

> E. Lowering of target constructs mentioned in ( 2(d) ) specifies direct
> lowering to LLVM IR ignoring all the advantages that MLIR provides. Being
> able to compile the code for heterogeneous hardware is one of the biggest
> advantages that MLIR brings to the table. That is being completely missed
> here. This also requires solving the problem of handling target information
> in MLIR. But that is a problem which needs to be solved anyway. Using GPU
> dialect also gives us an opportunity to represent offloading semantics in
> MLIR.

I'm unsure what the problem with "handling target information in MLIR" is but
whatever design we end up with, we need to know about the target
(triple) in all stages of the pipeline, even if it is just to pass it
down.

> Given the ability to represent multiple ModuleOps and the existence of GPU
> dialect, couldn't higher level optimizations on offloaded code be done at
> MLIR level?. The proposed design would lead us to the same problems that we
> are currently facing in LLVM IR.
>
> Also, OpenMP codegen will automatically benefit from the GPU dialect based
> optimizations. For example, it would be way easier to hoist a memory
> reference out of GPU kernel in MLIR than in LLVM IR.

While I agree with the premise that you can potentially reuse MLIR
transformations, it might not be as simple in practice.

As mentioned above, you cannot assume much about OpenMP codes, almost
nothing for a lot of application codes I have seen. Some examples:

If you have a function call, or any synchronization event for that
matter, located between two otherwise adjacent target regions (see
below), you cannot assume the two target regions will be offloaded to
the same device.
```
  #omp target
  {}
  foo();
  #omp target
  {}
```
Similarly, you cannot assume a `omp parallel` is allowed to be executed
with more than a single thread, or that a `omp [parallel] for` does not
have loop carried data-dependences, ...
Data-sharing attributes are also something that has to be treated
carefully:
```
x = 5;
#omp task
  x = 3;
print(x);
```
Should print 5, not 3.

I hope I convinced you that OpenMP is not trivially mappable to existing
dialects without proper analysis. If not, please let me know why you
expect it to be.

Now when it comes to code analyses, LLVM-IR offers a variety of
interesting features, ranging from a mature set of passes to the
cross-language LTO capabilities. We are working on the missing parts,
e.g., heterogeneous llvm::Modules as we speak. Simple OpenMP
optimizations are already present in LLVM and interesting ones are
prototyped for a while now (let me know if you want to see more not-yet
merged patches/optimizations). I also have papers, results, and
talks that might be interesting here. Let me know if you need pointers
to them.

Cheers,
  Johannes

On 02/13, Vinay Madhusudan via llvm-dev wrote:
> Hi,
> 
> I have few questions / concerns regarding the design of OpenMP dialect in
> MLIR that is currently being implemented, mainly for the f18 compiler.
> Below, I summarize the current state of various efforts in clang / f18 /
> MLIR / LLVM regarding this. Feel free to add to the list in case I have
> missed something.
> 
> 1. [May 2019] An OpenMPIRBuilder in LLVM was proposed for flang and clang
> frontends. Note that this proposal was before considering MLIR for FIR.
> 
> a. llvm-dev proposal :
> http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-May/000197.html
> 
> b. Patches in review: https://reviews.llvm.org/D70290. This also includes
> the clang codegen changes.
> 
> 2.  [July - September 2019] OpenMP dialect for MLIR was discussed /
> proposed with respect to the f18 compilation stack (keeping FIR in mind).
> 
> a. flang-dev discussion link:
> https://lists.llvm.org/pipermail/flang-dev/2019-September/000020.html
> 
> b. Design decisions captured in PPT:
> https://drive.google.com/file/d/1vU6LsblsUYGA35B_3y9PmBvtKOTXj1Fu/view
> 
> c. MLIR google groups discussion:
> https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/4Aj_eawdHiw
> 
> d. Target constructs  design:
> http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000285.html
> 
> e. SIMD constructs design:
> http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000278.html
> 
> 3.  [Jan 2020] OpenMP dialect RFC in llvm discourse :
> https://llvm.discourse.group/t/rfc-openmp-dialect-in-mlir/397
> 
> 4.  [Jan- Feb 2020] Implementation of OpenMP dialect in MLIR:
> 
> a. The first patch which introduces the OpenMP dialect was pushed.
> 
> b. Review of barrier construct is in progress:
> https://reviews.llvm.org/D72962
> 
> I have tried to list below different topics of interest (to different
> people) around this work. Most of these are in the design phase (or very
> new) and multiple parties are interested with different sets of goals in
> mind.
> 
> I.  Flang frontend and its integration
> 
> II. Fortran representation in MLIR / FIR development
> 
> III. OpenMP development for flang,  OpenMP builder in LLVM.
> 
> IV. Loop Transformations in MLIR / LLVM with respect to OpenMP.
> 
> It looks like the design has evolved over time and there is no one place
> which contains the latest design decisions that fits all the different
> pieces of the puzzle. I will try to deduce it from the above mentioned
> references. Please correct me If I am referring to anything which has
> changed.
> 
> A. For most OpenMP design discussions, FIR examples are used (as seen in
> (2) and (3)). The MLIR examples mentioned in the design only talks about
> FIR dialect and LLVM dialect.
> 
> This completely ignores the likes of standard, affine (where most loop
> transformations are supposed to happen) and loop dialects. I think it is
> critical to decouple the OpenMP dialect development in MLIR from the
> current flang / FIR effort. It would be useful if someone can mention these
> examples using existing dialects in MLIR and also how the different
> transformations / lowerings are planned.
> 
> B. In latest RFC(3), it is mentioned that the initial OpenMP dialect
> version will be as follows,
> 
>   omp.parallel {
> 
>     omp.do {
> 
>        fir.do %i = 0 to %ub3 : !fir.integer {
> 
>         ...
> 
>        }
> 
>     }
> 
>   }
> 
> and then after the "LLVM conversion" it is converted as follows:
> 
>   omp.parallel {
> 
>     %ub3 =
> 
>     omp.do %i = 0 to %ub3 : !llvm.integer {
> 
>     ...
> 
>     }
> 
>   }
> 
> 
> a. Is it the same omp.do operation which now contains the bounds and
> induction variables of the loop after the LLVM conversion? If so, will the
> same operation have two different semantics during a single compilation?
> 
> b. Will there be different lowerings for various loop operations from
> different dialects? loop.for and affine.for under omp operations would need
> different OpenMP / LLVM lowerings. Currently, both of them are lowered to
> the CFG based loops during the LLVM dialect conversion (which is much
> before the proposed OpenMP dialect lowering).
> 
> There would be no standard way to represent OpenMP operations (especially
> the ones which involve loops) in MLIR. This would drastically complicate
> lowering.
> 
> C. It is also not mentioned how clauses like firstprivate, shared, private,
> reduce, map, etc are lowered to OpenMP dialect. The example in the RFC
> contains FIR and LLVM types and nothing about std dialect types. Consider
> the below example:
> 
> #pragma omp parallel for reduction(+:x)
> 
> for (int i = 0; i < N; ++i)
> 
>   x += a[i];
> 
> How would the above be represented in OpenMP dialect? and What type would
> "x" be in MLIR?  It is not mentioned in the design as to how the various
> SSA values for various OpenMP clauses are passed around in OpenMP
> operations.
> 
> D. Because of (A), (B) and (C), it would be beneficial to have an omp.
> parallel_do operation which has semantics similar to other loop structures
> (may not be LoopLikeInterface) in MLIR. To me, it looks like having OpenMP
> operations based on standard MLIR types and operations (scalars and memrefs
> mainly) is the right way to go.
> 
> Why not have omp.parallel_do operation with AffineMap based bounds, so as
> to decouple it from Value/Type similar to affine.for?
> 
> 1. With the current design, the number of transformations / optimizations
> that one can write on OpenMP constructs would become limited as there can
> be any custom loop structure with custom operations / types inside it.
> 
> 2. It would also be easier to transform the Loop nests containing OpenMP
> constructs if the body of the OpenMP operations is well defined (i.e., does
> not accept arbitrary loop structures). Having nested redundant "parallel" ,
> "target" and "do" regions seems unnecessary.
> 
> 3. There would also be new sets of loop structures in new dialects when
> C/C++ is compiled to MLIR. It would complicate the number of possible
> combinations inside the OpenMP region.
> 
> E. Lowering of target constructs mentioned in ( 2(d) ) specifies direct
> lowering to LLVM IR ignoring all the advantages that MLIR provides. Being
> able to compile the code for heterogeneous hardware is one of the biggest
> advantages that MLIR brings to the table. That is being completely missed
> here. This also requires solving the problem of handling target information
> in MLIR. But that is a problem which needs to be solved anyway. Using GPU
> dialect also gives us an opportunity to represent offloading semantics in
> MLIR.
> 
> Given the ability to represent multiple ModuleOps and the existence of GPU
> dialect, couldn't higher level optimizations on offloaded code be done at
> MLIR level?. The proposed design would lead us to the same problems that we
> are currently facing in LLVM IR.
> 
> Also, OpenMP codegen will automatically benefit from the GPU dialect based
> optimizations. For example, it would be way easier to hoist a memory
> reference out of GPU kernel in MLIR than in LLVM IR.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200213/c0a311e4/attachment.sig>