<div><br></div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Feb 13, 2020 at 10:18 AM Johannes Doerfert via flang-dev <<a href="mailto:flang-dev@lists.llvm.org">flang-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Vinay,<br>

<br>

Thanks for taking an interest and the detailed discussion.<br>

<br>

To start by picking a few paragraph from your email to clarify a couple<br>

of things that lead to the current design or that might otherwise need<br>

clarification. We can talk about other points later as well.<br>

<br>

[<br>

  Site notes:<br>

    1) I'm not an MLIR person.<br>

    2) It seems unfortnuate that we do not have a mlir-dev list.</blockquote><div dir="auto"><br></div><div dir="auto">MLIR uses discourse, llvm.discourse.group.</div><div dir="auto"><br></div><div dir="auto"><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

]<br>

<br>

<br>

> 1. With the current design, the number of transformations / optimizations<br>

> that one can write on OpenMP constructs would become limited as there can<br>

> be any custom loop structure with custom operations / types inside it.<br>

<br>

OpenMP, as an input language, does not make many assumptions about the<br>

code inside of constructs*. So, inside a parallel can be almost anything<br>

the base language has to offer, both lexically and dynamically.<br>

Assuming otherwise is not going to work. Analyzing a "generic" OpenMP<br>

representation in order to determine if can be represented as a more<br>

restricted "op" seems at least plausible. You will run into various<br>

issue, some mentioned explicitly below. For starters, you still have to<br>

generate proper OpenMP runtime calls, e.g., from your GPU dialect, even<br>

if it is "just" to make sure the OMPD/OMPT interfaces expose useful<br>

information.<br>

<br>

<br>

* I preclude the `omp loop` construct here as it is not even implemented<br>

  anywhere as far as I know.<br>

<br>

<br>

> 2. It would also be easier to transform the Loop nests containing OpenMP<br>

> constructs if the body of the OpenMP operations is well defined (i.e., does<br>

> not accept arbitrary loop structures). Having nested redundant "parallel" ,<br>

> "target" and "do" regions seems unnecessary.<br>

<br>

As mentioned above, you cannot start with the assumption OpenMP input is<br>

structured this this way. You have to analyze it first. This is the same<br>

reason we cannot simply transform C/C++ `for loops` into `affine.for`<br>

without proper analysis of the loop body.<br>

<br>

Now, more concrete. Nested parallel and target regions are not<br>

necessarily redundant, nor can/should we require the user not to have<br>

them. Nested parallelism can easily make sense, depending on the problem<br>

decomposition. Nested target will make a lot of sense with reverse<br>

offload, which is already in the standard, and it also should be allowed<br>

for the sake of a modular (user) code base.<br>

<br>

<br>

> 3. There would also be new sets of loop structures in new dialects when<br>

> C/C++ is compiled to MLIR. It would complicate the number of possible<br>

> combinations inside the OpenMP region.<br>

<br>

Is anyone working on this? If so, what is the timeline? I personally was<br>

not expecting Clang to switch over to MLIR any time soon but I am happy<br>

if someone wants to correct me on this. I mention this only because it<br>

interacts with the arguments I will make below.<br>

<br>

<br>

> E. Lowering of target constructs mentioned in ( 2(d) ) specifies direct<br>

> lowering to LLVM IR ignoring all the advantages that MLIR provides. Being<br>

> able to compile the code for heterogeneous hardware is one of the biggest<br>

> advantages that MLIR brings to the table. That is being completely missed<br>

> here. This also requires solving the problem of handling target information<br>

> in MLIR. But that is a problem which needs to be solved anyway. Using GPU<br>

> dialect also gives us an opportunity to represent offloading semantics in<br>

> MLIR.<br>

<br>

I'm unsure what the problem with "handling target information in MLIR" is but<br>

whatever design we end up with, we need to know about the target<br>

(triple) in all stages of the pipeline, even if it is just to pass it<br>

down.<br>

<br>

<br>

> Given the ability to represent multiple ModuleOps and the existence of GPU<br>

> dialect, couldn't higher level optimizations on offloaded code be done at<br>

> MLIR level?. The proposed design would lead us to the same problems that we<br>

> are currently facing in LLVM IR.<br>

><br>

> Also, OpenMP codegen will automatically benefit from the GPU dialect based<br>

> optimizations. For example, it would be way easier to hoist a memory<br>

> reference out of GPU kernel in MLIR than in LLVM IR.<br>

<br>

While I agree with the premise that you can potentially reuse MLIR<br>

transformations, it might not be as simple in practice.<br>

<br>

As mentioned above, you cannot assume much about OpenMP codes, almost<br>

nothing for a lot of application codes I have seen. Some examples:<br>

<br>

If you have a function call, or any synchronization event for that<br>

matter, located between two otherwise adjacent target regions (see<br>

below), you cannot assume the two target regions will be offloaded to<br>

the same device.<br>

```<br>

  #omp target<br>

  {}<br>

  foo();<br>

  #omp target<br>

  {}<br>

```<br>

Similarly, you cannot assume a `omp parallel` is allowed to be executed<br>

with more than a single thread, or that a `omp [parallel] for` does not<br>

have loop carried data-dependences, ...<br>

Data-sharing attributes are also something that has to be treated<br>

carefully:<br>

```<br>

x = 5;<br>

#omp task<br>

  x = 3;<br>

print(x);<br>

```<br>

Should print 5, not 3.<br>

<br>

I hope I convinced you that OpenMP is not trivially mappable to existing<br>

dialects without proper analysis. If not, please let me know why you<br>

expect it to be.<br>

<br>

<br>

Now when it comes to code analyses, LLVM-IR offers a variety of<br>

interesting features, ranging from a mature set of passes to the<br>

cross-language LTO capabilities. We are working on the missing parts,<br>

e.g., heterogeneous llvm::Modules as we speak. Simple OpenMP<br>

optimizations are already present in LLVM and interesting ones are<br>

prototyped for a while now (let me know if you want to see more not-yet<br>

merged patches/optimizations). I also have papers, results, and<br>

talks that might be interesting here. Let me know if you need pointers<br>

to them.<br>

<br>

<br>

Cheers,<br>

  Johannes<br>

<br>

<br>

<br>

On 02/13, Vinay Madhusudan via llvm-dev wrote:<br>

> Hi,<br>

> <br>

> I have few questions / concerns regarding the design of OpenMP dialect in<br>

> MLIR that is currently being implemented, mainly for the f18 compiler.<br>

> Below, I summarize the current state of various efforts in clang / f18 /<br>

> MLIR / LLVM regarding this. Feel free to add to the list in case I have<br>

> missed something.<br>

> <br>

> 1. [May 2019] An OpenMPIRBuilder in LLVM was proposed for flang and clang<br>

> frontends. Note that this proposal was before considering MLIR for FIR.<br>

> <br>

> a. llvm-dev proposal :<br>

> <a href="http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-May/000197.html" rel="noreferrer" target="_blank">http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-May/000197.html</a><br>

> <br>

> b. Patches in review: <a href="https://reviews.llvm.org/D70290" rel="noreferrer" target="_blank">https://reviews.llvm.org/D70290</a>. This also includes<br>

> the clang codegen changes.<br>

> <br>

> 2.  [July - September 2019] OpenMP dialect for MLIR was discussed /<br>

> proposed with respect to the f18 compilation stack (keeping FIR in mind).<br>

> <br>

> a. flang-dev discussion link:<br>

> <a href="https://lists.llvm.org/pipermail/flang-dev/2019-September/000020.html" rel="noreferrer" target="_blank">https://lists.llvm.org/pipermail/flang-dev/2019-September/000020.html</a><br>

> <br>

> b. Design decisions captured in PPT:<br>

> <a href="https://drive.google.com/file/d/1vU6LsblsUYGA35B_3y9PmBvtKOTXj1Fu/view" rel="noreferrer" target="_blank">https://drive.google.com/file/d/1vU6LsblsUYGA35B_3y9PmBvtKOTXj1Fu/view</a><br>

> <br>

> c. MLIR google groups discussion:<br>

> <a href="https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/4Aj_eawdHiw" rel="noreferrer" target="_blank">https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/4Aj_eawdHiw</a><br>

> <br>

> d. Target constructs  design:<br>

> <a href="http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000285.html" rel="noreferrer" target="_blank">http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000285.html</a><br>

> <br>

> e. SIMD constructs design:<br>

> <a href="http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000278.html" rel="noreferrer" target="_blank">http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-September/000278.html</a><br>

> <br>

> 3.  [Jan 2020] OpenMP dialect RFC in llvm discourse :<br>

> <a href="https://llvm.discourse.group/t/rfc-openmp-dialect-in-mlir/397" rel="noreferrer" target="_blank">https://llvm.discourse.group/t/rfc-openmp-dialect-in-mlir/397</a><br>

> <br>

> 4.  [Jan- Feb 2020] Implementation of OpenMP dialect in MLIR:<br>

> <br>

> a. The first patch which introduces the OpenMP dialect was pushed.<br>

> <br>

> b. Review of barrier construct is in progress:<br>

> <a href="https://reviews.llvm.org/D72962" rel="noreferrer" target="_blank">https://reviews.llvm.org/D72962</a><br>

> <br>

> I have tried to list below different topics of interest (to different<br>

> people) around this work. Most of these are in the design phase (or very<br>

> new) and multiple parties are interested with different sets of goals in<br>

> mind.<br>

> <br>

> I.  Flang frontend and its integration<br>

> <br>

> II. Fortran representation in MLIR / FIR development<br>

> <br>

> III. OpenMP development for flang,  OpenMP builder in LLVM.<br>

> <br>

> IV. Loop Transformations in MLIR / LLVM with respect to OpenMP.<br>

> <br>

> It looks like the design has evolved over time and there is no one place<br>

> which contains the latest design decisions that fits all the different<br>

> pieces of the puzzle. I will try to deduce it from the above mentioned<br>

> references. Please correct me If I am referring to anything which has<br>

> changed.<br>

> <br>

> A. For most OpenMP design discussions, FIR examples are used (as seen in<br>

> (2) and (3)). The MLIR examples mentioned in the design only talks about<br>

> FIR dialect and LLVM dialect.<br>

> <br>

> This completely ignores the likes of standard, affine (where most loop<br>

> transformations are supposed to happen) and loop dialects. I think it is<br>

> critical to decouple the OpenMP dialect development in MLIR from the<br>

> current flang / FIR effort. It would be useful if someone can mention these<br>

> examples using existing dialects in MLIR and also how the different<br>

> transformations / lowerings are planned.<br>

> <br>

> B. In latest RFC(3), it is mentioned that the initial OpenMP dialect<br>

> version will be as follows,<br>

> <br>

>   omp.parallel {<br>

> <br>

>     omp.do {<br>

> <br>

>        fir.do %i = 0 to %ub3 : !fir.integer {<br>

> <br>

>         ...<br>

> <br>

>        }<br>

> <br>

>     }<br>

> <br>

>   }<br>

> <br>

> and then after the "LLVM conversion" it is converted as follows:<br>

> <br>

>   omp.parallel {<br>

> <br>

>     %ub3 =<br>

> <br>

>     omp.do %i = 0 to %ub3 : !llvm.integer {<br>

> <br>

>     ...<br>

> <br>

>     }<br>

> <br>

>   }<br>

> <br>

> <br>

> a. Is it the same omp.do operation which now contains the bounds and<br>

> induction variables of the loop after the LLVM conversion? If so, will the<br>

> same operation have two different semantics during a single compilation?<br>

> <br>

> b. Will there be different lowerings for various loop operations from<br>

> different dialects? loop.for and affine.for under omp operations would need<br>

> different OpenMP / LLVM lowerings. Currently, both of them are lowered to<br>

> the CFG based loops during the LLVM dialect conversion (which is much<br>

> before the proposed OpenMP dialect lowering).<br>

> <br>

> There would be no standard way to represent OpenMP operations (especially<br>

> the ones which involve loops) in MLIR. This would drastically complicate<br>

> lowering.<br>

> <br>

> C. It is also not mentioned how clauses like firstprivate, shared, private,<br>

> reduce, map, etc are lowered to OpenMP dialect. The example in the RFC<br>

> contains FIR and LLVM types and nothing about std dialect types. Consider<br>

> the below example:<br>

> <br>

> #pragma omp parallel for reduction(+:x)<br>

> <br>

> for (int i = 0; i < N; ++i)<br>

> <br>

>   x += a[i];<br>

> <br>

> How would the above be represented in OpenMP dialect? and What type would<br>

> "x" be in MLIR?  It is not mentioned in the design as to how the various<br>

> SSA values for various OpenMP clauses are passed around in OpenMP<br>

> operations.<br>

> <br>

> D. Because of (A), (B) and (C), it would be beneficial to have an omp.<br>

> parallel_do operation which has semantics similar to other loop structures<br>

> (may not be LoopLikeInterface) in MLIR. To me, it looks like having OpenMP<br>

> operations based on standard MLIR types and operations (scalars and memrefs<br>

> mainly) is the right way to go.<br>

> <br>

> Why not have omp.parallel_do operation with AffineMap based bounds, so as<br>

> to decouple it from Value/Type similar to affine.for?<br>

> <br>

> 1. With the current design, the number of transformations / optimizations<br>

> that one can write on OpenMP constructs would become limited as there can<br>

> be any custom loop structure with custom operations / types inside it.<br>

> <br>

> 2. It would also be easier to transform the Loop nests containing OpenMP<br>

> constructs if the body of the OpenMP operations is well defined (i.e., does<br>

> not accept arbitrary loop structures). Having nested redundant "parallel" ,<br>

> "target" and "do" regions seems unnecessary.<br>

> <br>

> 3. There would also be new sets of loop structures in new dialects when<br>

> C/C++ is compiled to MLIR. It would complicate the number of possible<br>

> combinations inside the OpenMP region.<br>

> <br>

> E. Lowering of target constructs mentioned in ( 2(d) ) specifies direct<br>

> lowering to LLVM IR ignoring all the advantages that MLIR provides. Being<br>

> able to compile the code for heterogeneous hardware is one of the biggest<br>

> advantages that MLIR brings to the table. That is being completely missed<br>

> here. This also requires solving the problem of handling target information<br>

> in MLIR. But that is a problem which needs to be solved anyway. Using GPU<br>

> dialect also gives us an opportunity to represent offloading semantics in<br>

> MLIR.<br>

> <br>

> Given the ability to represent multiple ModuleOps and the existence of GPU<br>

> dialect, couldn't higher level optimizations on offloaded code be done at<br>

> MLIR level?. The proposed design would lead us to the same problems that we<br>

> are currently facing in LLVM IR.<br>

> <br>

> Also, OpenMP codegen will automatically benefit from the GPU dialect based<br>

> optimizations. For example, it would be way easier to hoist a memory<br>

> reference out of GPU kernel in MLIR than in LLVM IR.<br>

_______________________________________________<br>

flang-dev mailing list<br>

<a href="mailto:flang-dev@lists.llvm.org" target="_blank">flang-dev@lists.llvm.org</a><br>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/flang-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/flang-dev</a><br>

</blockquote></div></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">Thank you,<br>  River Riddle</div>