<font size=2 face="sans-serif">Hi Johannes,</font><br><br><font size=2 face="sans-serif">Thank you for the explanation.</font><br><br><font size=2 face="sans-serif">I think we need to clarify some details

about code generation in Clang today:</font><br><br><font size=2 face="sans-serif">1. non-SPMD mode, or generic mode, uses

the master-worker code gen scheme where the master thread and the worker

threads are <b>disjoint </b>sets of threads (when one set runs the other

set is blocked and doesn't participate in the execution):</font><br><br><font size=2 face="Courier">workers  |  master</font><br><font size=2 face="Courier">====================</font><br><font size=2 face="Courier">BLOCKED  | <b>RUNNING</b></font><br><font size=2 face="Courier">------- sync -------</font><br><font size=2 face="Courier"><b>RUNNING  </b>| BLOCKED</font><br><font size=2 face="Courier">------- sync -------</font><br><font size=2 face="Courier">BLOCKED  | <b>RUNNING</b></font><br><br><br><font size=2 face="sans-serif">2. the worker threads, in their RUNNING

state above, contain a state machine which chooses the parallel region

to be executed. Today this choice happens in one of two ways: explicit

targets (where you know what outlined region you are calling and you just

call it) and indirect targets (via function pointer set by master thread

in one of its RUNNING regions):</font><br><br><font size=2 face="Courier">workers  |  master</font><br><font size=2 face="Courier">====================</font><br><font size=2 face="Courier">BLOCKED  | <b>RUNNING</b></font><br><font size=2 face="Courier">------- sync -------</font><br><font size=2 face="Courier"><b>RUNNING  </b>|</font><br><font size=2 face="Courier"><b> state   </b>| BLOCKED</font><br><font size=2 face="Courier"><b>machine  </b>|</font><br><font size=2 face="Courier">------- sync -------</font><br><font size=2 face="Courier">BLOCKED  | <b>RUNNING</b></font><br><br><br><font size=2 face="sans-serif">Your intended changes (only target the

RUNNING state machine of the WORKERS):</font><br><font size=2 face="sans-serif">- remove explicit targets from current

code gen. (by itself this is a major step back!!)</font><br><font size=2 face="sans-serif">- introduce a pass in LLVM which will

add back the explicit targets.</font><br><br><font size=2 face="sans-serif">Can you point out any major improvements

this will bring compared to the current state?</font><br><font size=2 face="sans-serif">From your answer below you mention a

lower number of function calls. Since today we inline everything anyway

how does that help?</font><br><font size=2 face="sans-serif">If you haven't considered performance

so far how come you're proposing all these changes? What led you to propose

all these changes?</font><br><br><br><font size=2 face="sans-serif">In SPMD mode all threads execute the

same code. Using the notation in the schemes above you can depict this

as:</font><br><br><font size=2 face="Courier">    all threads</font><br><font size=2 face="Courier">====================</font><br><font size=2 face="Courier"><b>      RUNNING</b></font><br><br><font size=2 face="sans-serif">No state machine being used, no disjoints

sets of threads. This is as if you're executing CUDA code.</font><br><br><font size=2 face="sans-serif">Could you explain what your proposed

changes are in this context?</font><br><font size=2 face="sans-serif">Could you also explain what you mean

by "</font><tt><font size=2>assuming SPMD wasn't achieved</font></tt><font size=2 face="sans-serif">"?</font><br><font size=2 face="sans-serif">Do you expect to write another LLVM

pass which will transform the master-worker scheme + state machine into

an SPMD scheme?</font><br><br><font size=2 face="sans-serif">Thanks,</font><br><br><font size=2 face="sans-serif">--Doru</font><br><br><br><font size=2 face="sans-serif"><br></font><br><br><br><br><font size=1 color=#5f5f5f face="sans-serif">From:      

 </font><font size=1 face="sans-serif">"Doerfert, Johannes"

<jdoerfert@anl.gov></font><br><font size=1 color=#5f5f5f face="sans-serif">To:      

 </font><font size=1 face="sans-serif">Gheorghe-Teod Bercea

<Gheorghe-Teod.Bercea@ibm.com></font><br><font size=1 color=#5f5f5f face="sans-serif">Cc:      

 </font><font size=1 face="sans-serif">Alexey Bataev <a.bataev@outlook.com>,

Guray Ozen <gozen@nvidia.com>, "Gregory.Rodgers@amd.com"

<Gregory.Rodgers@amd.com>, "Finkel, Hal J." <hfinkel@anl.gov>,

"kli@ca.ibm.com" <kli@ca.ibm.com>, "openmp-dev@lists.llvm.org"

<openmp-dev@lists.llvm.org>, LLVM-Dev <llvm-dev@lists.llvm.org>,

"cfe-dev@lists.llvm.org" <cfe-dev@lists.llvm.org></font><br><font size=1 color=#5f5f5f face="sans-serif">Date:      

 </font><font size=1 face="sans-serif">01/30/2019 07:56 PM</font><br><font size=1 color=#5f5f5f face="sans-serif">Subject:    

   </font><font size=1 face="sans-serif">Re: [RFC] Late

(OpenMP) GPU code "SPMD-zation"</font><br><hr noshade><br><br><br><tt><font size=2>Hi Doru,<br><br>[+ llvm-dev and cfe-dev]<br><br>On 01/30, Gheorghe-Teod Bercea wrote:<br>> Hi Johannes,<br>> <br>> First of all thanks for looking into the matter of improving non-SPMD

mode!<br>> <br>> I have a question regarding the state machine that you said you'd

like to<br>> replace/improve. There are cases (such as target regions that span

multiple<br>> compilation units) where the switch statement is required. Is this

something<br>> that your changes will touch in any way?<br><br>There will not be a difference. Let me explain in some details as there<br>seems to be a lot of confusion on this state machine topic:<br><br>Now:<br><br>Build a state machine in the user code (module) with all the parallel<br>regions as explicit targets of the switch statement and a fallback<br>default that does a indirect call to the requested parallel region.<br><br><br>Proposed, after Clang:<br><br>Use the runtime state machine implementation [0] which reduces the<br>switch to the default case, thus an indirect call to the requested<br>parallel region. This will always work, regardless of the translation<br>unit that contained the parallel region (pointer).<br><br>Proposed, after OpenMP-Opt pass in LLVM (assuming SPMD wasn't achieved):<br><br>All reachable parallel regions in a kernel are collected and used to<br>create the switch statement in the user code (module) [1, line 111] with<br>a fallback if there are potentially [1, line 212] hidden parallel<br>regions.<br><br><br>Does that make sense?<br><br><br>[0] </font></tt><a href="https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz"><tt><font size=2>https://reviews.llvm.org/D57460#change-e9Ljd9RgdWYz</font></tt></a><tt><font size=2><br>[1] </font></tt><a href="https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B"><tt><font size=2>https://reviews.llvm.org/D57460#change-8gnnGNfJVR4B</font></tt></a><tt><font size=2><br><br><br>> My next question is, for the workloads which are in the same compilation

unit<br>> there is a trick that code gen performs (or could perform I'm not

sure if this<br>> has been upstreamed) where it can check for the specific name of an

outlined<br>> function and then just call it directly thus making that function

inline-able<br>> (thus erasing most if not all the overhead of having the state machine

in the<br>> first place). In other words the "worst" part of the switch

statement will only<br>> apply to outlined functions from other compilation units. With this

in mind<br>> what would the impact of your changes be in the end? If this part

isn't clear I<br>> can do some digging to find out how this actually works in more details

it's<br>> been too long since I've had to look at this part.<br><br>See the answer above.<br><br><br>> Can you share some performance numbers given an example you have been

looking<br>> at? I see you have one that uses "#pragma omp atomic". I

would avoid using<br>> something like that since it may have other overheads not related

to your<br>> changes. I would put together an example with this directive structure:<br>> <br>> #pragma omp target teams distribute<br>> for(...){<br>>   <code1><br>>   #pragma omp parallel for<br>>   for(...) {<br>>     <code2><br>>   }<br>>   <code3><br>> }<br>> <br>> which forces the use of the master-worker scheme (non-SPMD mode) without

any<br>> other distractions.<br><br>The atomic stuff I used to determine correctness. I haven't yet looked<br>at performance. I will do so now and inform you on my results.<br><br><br>> It would then be interesting to understand how you plan to change

the LLVM code<br>> generated for this,<br><br>The examples show how the LLVM-IR is supposed to look like, right?<br><br>> what the overheads that you're targeting are (register usage,<br>> synchronization cost etc), and then what the performance gain is<br>> compared to the current scheme.<br><br>I can also compare register usage in addition to performance but there<br>is no difference in synchronization. The number and (relative) order of<br>original runtime library calls stays the same. The number of user code<br>-> runtime library calls is even decreased.<br><br><br>Please let me know if this helps and what questions remain.<br><br>Thanks,<br>  Johannes<br><br><br> <br>> From:        "Doerfert, Johannes" <jdoerfert@anl.gov><br>> To:        Alexey Bataev <a.bataev@outlook.com><br>> Cc:        Guray Ozen <gozen@nvidia.com>,

Gheorghe-Teod Bercea > <gheorghe-teod.bercea@ibm.com>, "openmp-dev@lists.llvm.org" > <openmp-dev@lists.llvm.org>, "Finkel, Hal J." <hfinkel@anl.gov>, > "Gregory.Rodgers@amd.com" <Gregory.Rodgers@amd.com>,

"kli@ca.ibm.com"<br>> <kli@ca.ibm.com><br>> Date:        01/30/2019 04:14 PM<br>> Subject:        Re: [RFC] Late (OpenMP) GPU code

"SPMD-zation"<br>> $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>> <br>> <br>> <br>> I don't really see "many ifs and maybes", actually none.<br>> <br>> Anyway, I will now work on a patch set that adds the new functionality

under a<br>> cmd flag<br>> in order to showcase correctness and performance on real code.<br>> <br>> If you, or somebody else, have interesting examples, please feel free

to point<br>> me at them.<br>> <br>> Thanks,<br>>   Johannes<br>> <br>> <br>> $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>> <br>> From: Alexey Bataev <a.bataev@outlook.com><br>> Sent: Wednesday, January 30, 2019 2:18:19 PM<br>> To: Doerfert, Johannes<br>> Cc: Guray Ozen; Gheorghe-Teod Bercea; openmp-dev@lists.llvm.org; Finkel,

Hal > J.; Gregory.Rodgers@amd.com; kli@ca.ibm.com > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation" >   > Currently, there are too many "if"s and "maybe"s.

If you can provide solution > that does not break anything and does not affect the performance,

does not<br>> require changes in the backend - then go ahead with the patches.<br>> <br>> -------------<br>> Best regards,<br>> Alexey Bataev<br>> 30.01.2019 14:49, Doerfert, Johannes      :<br>> No, SPMD mode will not be affected at all.<br>> <br>> The "worse" part is the following:<br>>   If we inline runtime library calls before the openmp-opt pass

had a chance to<br>> look at the code,<br>>   we will not have a customized state machine for the __non-SPMD__

case. That > is, the if-cascade >   checking the work function pointer is not there. >  > Avoiding this potential performance decline is actually very easy.

While we do<br>> not have the "inline_late" capability,<br>> run the openmp-opt pass __before__ the inliner and we will not get

"worse"<br>> code. We might however miss out on<br>> _new_ non-SPMD -> SPMD transformations.<br>> <br>> <br>> Does that make sense?<br>> <br>> $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>> <br>> From: Alexey Bataev <a.bataev@outlook.com><br>> Sent: Wednesday, January 30, 2019 1:44:10 PM<br>> To: Doerfert, Johannes<br>> Cc: Guray Ozen; Gheorghe-Teod Bercea; openmp-dev@lists.llvm.org; Finkel,

Hal<br>> J.; Gregory.Rodgers@amd.com; kli@ca.ibm.com<br>> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"<br>>  <br>> Any "worse" is not a good idea. We need to avoid it. It

would be good that the<br>> new code did not affect the performance, especially for SPMD mode

(I think,<br>> this "worse" will affect exactly SPMD mode, no?)<br>> <br>> -------------<br>> Best regards,<br>> Alexey Bataev<br>> 30.01.2019 14:38, Doerfert, Johannes      :<br>> The LLVM optimization (openmp-opt), which does non-SPMD -> SPMD

and custom<br>> state machine generation, will not fire if<br>> the __kernel_general_... calls are "missing". Thus if we

inline "to early", we<br>> are "stuck" with the non-SPMD choice (not worse than<br>> what we have now!) and the default library state machine ("worse"

than what we<br>> have right now). Does that make sense?<br>> <br>> The second option described what I want to see us do "later"

in order to avoid<br>> the above scenario and always get both,<br>> openmp-opt and inlining of the runtime and work functions.<br>> <br>> <br>> $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>> <br>> From: Alexey Bataev <a.bataev@outlook.com><br>> Sent: Wednesday, January 30, 2019 1:25:42 PM<br>> To: Doerfert, Johannes<br>> Cc: Guray Ozen; Gheorghe-Teod Bercea; openmp-dev@lists.llvm.org; Finkel,

Hal<br>> J.; Gregory.Rodgers@amd.com; kli@ca.ibm.com<br>> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"<br>>  <br>> Sorry, did not understand your answer correctly. But you wrote:<br>> for now, not doing the optimization is just fine.<br>> What do you mean?<br>> <br>> -------------<br>> Best regards,<br>> Alexey Bataev<br>> 30.01.2019 14:23, Doerfert, Johannes      :<br>> Alexey,<br>> <br>> I'm not sure how to interpret "Bad idea!". but I think there

is again a<br>> misunderstanding.<br>> To help me understand, could you try to elaborate a bit?<br>> <br>> To make my last email clear:<br>> I __do__ want inlining. Both answers to your earlier inlining questions

do<br>> actually assume the runtime library calls __are eventually inlined__,<br>> that is why I mentioned LTO and the runtime as bitcode.<br>> .<br>> Cheers,<br>>   Johannes<br>> <br>> <br>> <br>> $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B<br>> <br>> From: Alexey Bataev <a.bataev@outlook.com><br>> Sent: Wednesday, January 30, 2019 1:14:56 PM<br>> To: Doerfert, Johannes<br>> Cc: Guray Ozen; Gheorghe-Teod Bercea; openmp-dev@lists.llvm.org; Finkel,

Hal > J.; Gregory.Rodgers@amd.com; kli@ca.ibm.com > Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation" >   > Bad idea! >  > ------------- > Best regards, > Alexey Bataev > 30.01.2019 14:11, Doerfert, Johannes      : > Sure I do. Why do you think I don't? >  > $B(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(,(B >  > From: Alexey Bataev <a.bataev@outlook.com> > Sent: Wednesday, January 30, 2019 1:00:59 PM > To: Doerfert, Johannes > Cc: Guray Ozen; Gheorghe-Teod Bercea; openmp-dev@lists.llvm.org; Finkel,

Hal<br>> J.; Gregory.Rodgers@amd.com; kli@ca.ibm.com<br>> Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"<br>>  <br>> You don't want to do the inlining?<br>> <br>> -------------<br>> Best regards,<br>> Alexey Bataev<br>> 30.01.2019 13:59, Doerfert, Johannes      :<br>> - for now, not doing the optimization is just fine. The whole idea

is that code<br>> is always valid.<br>> <br>> <br><br>-- <br><br>Johannes Doerfert<br>Researcher<br><br>Argonne National Laboratory<br>Lemont, IL 60439, USA<br><br>jdoerfert@anl.gov<br>[attachment "signature.asc" deleted by Gheorghe-Teod Bercea/US/IBM]

</font></tt><br><br><BR>