<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=KOI8-R">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p><br>

    </p>

    <pre class="moz-signature" cols="72">-------------

Best regards,

Alexey Bataev</pre>

    <div class="moz-cite-prefix">13.03.2019 15:35, Doerfert, Johannes

      пишет:<br>

    </div>

    <blockquote type="cite"

cite="mid:DM5PR09MB37335C489F2582FF54D7049FBA4A0@DM5PR09MB3733.namprd09.prod.outlook.com">

      <meta http-equiv="Content-Type" content="text/html;

        charset=KOI8-R">

      <style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>

      <div id="divtagdefaultwrapper"

style="font-size:12pt;color:#000000;font-family:Calibri,Helvetica,sans-serif;"

        dir="ltr">

        <p style="margin-top:0;margin-bottom:0">Hi Alexey,</p>

        <p style="margin-top:0;margin-bottom:0"><br>

        </p>

        <p style="margin-top:0;margin-bottom:0">thank you for your quick

          feedback. <br>

        </p>

        <p style="margin-top:0;margin-bottom:0"><br>

        </p>

        <p style="margin-top:0;margin-bottom:0"><span>> There are

            tooooooo(!) many changes, I don't who's going to review

            sooooo big patch.

            <br>

          </span></p>

        <p style="margin-top:0;margin-bottom:0"><span></span><br>

        </p>

        <p style="margin-top:0;margin-bottom:0">I can for sure split it

          in the three components/repositories that are touched, clang,

          llvm, and openmp.</p>

        <p style="margin-top:0;margin-bottom:0">I feared it will then be

          harder to navigate the code in order to see the connection

          points.</p>

        <p style="margin-top:0;margin-bottom:0">I am a bit amazed by

          your hyperbolism though given the complexity is not that

          height</p>

        <p style="margin-top:0;margin-bottom:0">due to the absence of

          modified or removed lines. Anyway, you seem to have very

          strong feelings about</p>

        <p style="margin-top:0;margin-bottom:0">this so I am open to

          suggestion on how to split it up.</p>

        <p style="margin-top:0;margin-bottom:0"><br>

        </p>

      </div>

    </blockquote>

    <p><br>

    </p>

    <p>1. You definitely need to split it into separate patches for

      different components.</p>

    <p>2. Even inside of those components this patch must be split into

      several small patches, it is very hard to review so big patches.<br>

    </p>

    <p><br>

    </p>

    <blockquote type="cite"

cite="mid:DM5PR09MB37335C489F2582FF54D7049FBA4A0@DM5PR09MB3733.namprd09.prod.outlook.com">

      <div id="divtagdefaultwrapper"

style="font-size:12pt;color:#000000;font-family:Calibri,Helvetica,sans-serif;"

        dir="ltr">

        <p style="margin-top:0;margin-bottom:0">

        </p>

        <p style="margin-top:0;margin-bottom:0">> <span>Also, I

            don't like the idea adding of one more class for NVPTX

            codegen.

            <span>All your changes should be on top of the eixisting

              solution.</span></span><br>

        </p>

        <div><br>

        </div>

        <div>Could you please explain to me why? This will only make

          everything more complicated and entangled.

          <br>

        </div>

        <div>Also, the new class is supposed to be "target agnostic" so

          a new offloading target, e.g., AMD GPUs, could easily reuse</div>

        <div>the new code while the old code is sprinkled with NVPTX

          specific details, e.g., function calls, constants, etc.</div>

        <div><br>

        </div>

      </div>

    </blockquote>

    <p>1. As far as I know, even now the NVPTX codegen can be reused for

      AMD GPUs with some small changes. <br>

    </p>

    <p>2. Your patch is about codegen for NVPTX, so you must change the

      existing codegen, but not to introduce the new one for the same

      target. There is no point to maintain two different codegens for

      one target.<br>

    </p>

    <p><br>

    </p>

    <blockquote type="cite"

cite="mid:DM5PR09MB37335C489F2582FF54D7049FBA4A0@DM5PR09MB3733.namprd09.prod.outlook.com">

      <div id="divtagdefaultwrapper"

style="font-size:12pt;color:#000000;font-family:Calibri,Helvetica,sans-serif;"

        dir="ltr">

        <div>

        </div>

        <div><br>

        </div>

        <div id="Signature">Thanks again,</div>

        <div>  Johannes<br>

        </div>

      </div>

      <hr style="display:inline-block;width:98%" tabindex="-1">

      <div id="divRplyFwdMsg" dir="ltr"><font style="font-size:11pt"

          face="Calibri, sans-serif" color="#000000"><b>From:</b> Alexey

          Bataev <a class="moz-txt-link-rfc2396E" href="mailto:a.bataev@outlook.com"><a.bataev@outlook.com></a><br>

          <b>Sent:</b> Wednesday, March 13, 2019 2:15:39 PM<br>

          <b>To:</b> Doerfert, Johannes; <a class="moz-txt-link-abbreviated" href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a><br>

          <b>Cc:</b> <a class="moz-txt-link-abbreviated" href="mailto:openmp-dev@lists.llvm.org">openmp-dev@lists.llvm.org</a>; LLVM-Dev; Finkel, Hal

          J.; Alexey Bataev; Arpith Chacko Jacob<br>

          <b>Subject:</b> Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"</font>

        <div> </div>

      </div>

      <div style="background-color:#FFFFFF">

        <p>There are tooooooo(!) many changes, I don't who's going to

          review sooooo big patch. You definitely need to split it into

          several smaller patches. Also, I don't like the idea adding of

          one more class for NVPTX codegen. All your changes should be

          on top of the eixisting solution.<br>

        </p>

        <pre class="x_moz-signature" cols="72">-------------

Best regards,

Alexey Bataev</pre>

        <div class="x_moz-cite-prefix">13.03.2019 15:08, Doerfert,

          Johannes пишет:<br>

        </div>

        <blockquote type="cite">

          <pre class="x_moz-quote-pre">Please consider reviewing the code for the proposed approach here:

  <a class="x_moz-txt-link-freetext" href="https://reviews.llvm.org/D57460" moz-do-not-send="true">https://reviews.llvm.org/D57460</a>

Initial tests, e.g., on the nw (needleman-wunsch) benchmark in the

rodinia 3.1 benchmark suite, showed 30% improvement after SPMD mode was

enabled automatically. The code in nw is conceptually equivalent to the

first example in the "to_SPMD_mode.ll" test case that can be found here:

  <a class="x_moz-txt-link-freetext" href="https://reviews.llvm.org/D57460#change-sBfg7kuN4Bid" moz-do-not-send="true">https://reviews.llvm.org/D57460#change-sBfg7kuN4Bid</a>

The implementation is missing key features but one should be able to see

the overall design by now. Once accepted, the missing features and more

optimizations will be added.

On 01/22, Johannes Doerfert wrote:

</pre>

          <blockquote type="cite">

            <pre class="x_moz-quote-pre">Where we are

------------

Currently, when we generate OpenMP target offloading code for GPUs, we

use sufficient syntactic criteria to decide between two execution modes:

  1)      SPMD -- All target threads (in an OpenMP team) run all the code.

  2) "Guarded" -- The master thread (of an OpenMP team) runs the user

                  code. If an OpenMP distribute region is encountered, thus

                  if all threads (in the OpenMP team) are supposed to

                  execute the region, the master wakes up the idling

                  worker threads and points them to the correct piece of

                  code for distributed execution.

For a variety of reasons we (generally) prefer the first execution mode.

However, depending on the code, that might not be valid, or we might

just not know if it is in the Clang code generation phase.

The implementation of the "guarded" execution mode follows roughly the

state machine description in [1], though the implementation is different

(more general) nowadays.

What we want

------------

Increase the amount of code executed in SPMD mode and the use of

lightweight "guarding" schemes where appropriate.

How we get (could) there

------------------------

We propose the following two modifications in order:

  1) Move the state machine logic into the OpenMP runtime library. That

     means in SPMD mode all device threads will start the execution of

     the user code, thus emerge from the runtime, while in guarded mode

     only the master will escape the runtime and the other threads will

     idle in their state machine code that is now just "hidden".

     Why:

     - The state machine code cannot be (reasonably) optimized anyway,

       moving it into the library shouldn't hurt runtime but might even

       improve compile time a little bit.

     - The change should also simplify the Clang code generation as we

       would generate structurally the same code for both execution modes

       but only the runtime library calls, or their arguments, would

       differ between them.

     - The reason we should not "just start in SPMD mode" and "repair"

       it later is simple, this way we always have semantically correct

       and executable code.

     - Finally, and most importantly, there is now only little

       difference (see above) between the two modes in the code

       generated by clang. If we later analyze the code trying to decide

       if we can use SPMD mode instead of guarded mode the analysis and

       transformation becomes much simpler.

 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,

    e.g., through the runtime library calls used, and that tries to

    convert it into the SPMD mode potentially by introducing lightweight

    guards in the process.

    Why:

    - After the inliner, and the canonicalizations, we have a clearer

      picture of the code that is actually executed in the target

      region and all the side effects it contains. Thus, we can make an

      educated decision on the required amount of guards that prevent

      unwanted side effects from happening after a move to SPMD mode.

    - At this point we can more easily introduce different schemes to

      avoid side effects by threads that were not supposed to run. We

      can decide if a state machine is needed, conditionals should be

      employed, masked instructions are appropriate, or "dummy" local

      storage can be used to hide the side effect from the outside

      world.

None of this was implemented yet but we plan to start in the immediate

future. Any comments, ideas, criticism is welcome!

Cheers,

  Johannes

P.S. [2-4] Provide further information on implementation and features.

[1] <a class="x_moz-txt-link-freetext" href="https://ieeexplore.ieee.org/document/7069297" moz-do-not-send="true">https://ieeexplore.ieee.org/document/7069297</a>

[2] <a class="x_moz-txt-link-freetext" href="https://dl.acm.org/citation.cfm?id=2833161" moz-do-not-send="true">https://dl.acm.org/citation.cfm?id=2833161</a>

[3] <a class="x_moz-txt-link-freetext" href="https://dl.acm.org/citation.cfm?id=3018870" moz-do-not-send="true">https://dl.acm.org/citation.cfm?id=3018870</a>

[4] <a class="x_moz-txt-link-freetext" href="https://dl.acm.org/citation.cfm?id=3148189" moz-do-not-send="true">https://dl.acm.org/citation.cfm?id=3148189</a>

-- 

Johannes Doerfert

Researcher

Argonne National Laboratory

Lemont, IL 60439, USA

<a class="x_moz-txt-link-abbreviated" href="mailto:jdoerfert@anl.gov" moz-do-not-send="true">jdoerfert@anl.gov</a>

</pre>

          </blockquote>

          <pre class="x_moz-quote-pre">

</pre>

        </blockquote>

      </div>

    </blockquote>

  </body>

</html>