[Openmp-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

Wed Mar 13 12:35:59 PDT 2019

Hi Alexey,

thank you for your quick feedback.

> There are tooooooo(!) many changes, I don't who's going to review sooooo big patch.

I can for sure split it in the three components/repositories that are touched, clang, llvm, and openmp.

I feared it will then be harder to navigate the code in order to see the connection points.

I am a bit amazed by your hyperbolism though given the complexity is not that height

due to the absence of modified or removed lines. Anyway, you seem to have very strong feelings about

this so I am open to suggestion on how to split it up.

> Also, I don't like the idea adding of one more class for NVPTX codegen. All your changes should be on top of the eixisting solution.

Could you please explain to me why? This will only make everything more complicated and entangled.
Also, the new class is supposed to be "target agnostic" so a new offloading target, e.g., AMD GPUs, could easily reuse
the new code while the old code is sprinkled with NVPTX specific details, e.g., function calls, constants, etc.

Thanks again,
  Johannes
________________________________
From: Alexey Bataev <a.bataev at outlook.com>
Sent: Wednesday, March 13, 2019 2:15:39 PM
To: Doerfert, Johannes; cfe-dev at lists.llvm.org
Cc: openmp-dev at lists.llvm.org; LLVM-Dev; Finkel, Hal J.; Alexey Bataev; Arpith Chacko Jacob
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"

There are tooooooo(!) many changes, I don't who's going to review sooooo big patch. You definitely need to split it into several smaller patches. Also, I don't like the idea adding of one more class for NVPTX codegen. All your changes should be on top of the eixisting solution.

-------------
Best regards,
Alexey Bataev

13.03.2019 15:08, Doerfert, Johannes пишет:

Please consider reviewing the code for the proposed approach here:
  https://reviews.llvm.org/D57460

Initial tests, e.g., on the nw (needleman-wunsch) benchmark in the
rodinia 3.1 benchmark suite, showed 30% improvement after SPMD mode was
enabled automatically. The code in nw is conceptually equivalent to the
first example in the "to_SPMD_mode.ll" test case that can be found here:
  https://reviews.llvm.org/D57460#change-sBfg7kuN4Bid

The implementation is missing key features but one should be able to see
the overall design by now. Once accepted, the missing features and more
optimizations will be added.

On 01/22, Johannes Doerfert wrote:

Where we are
------------

Currently, when we generate OpenMP target offloading code for GPUs, we
use sufficient syntactic criteria to decide between two execution modes:
  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
                  code. If an OpenMP distribute region is encountered, thus
                  if all threads (in the OpenMP team) are supposed to
                  execute the region, the master wakes up the idling
                  worker threads and points them to the correct piece of
                  code for distributed execution.

For a variety of reasons we (generally) prefer the first execution mode.
However, depending on the code, that might not be valid, or we might
just not know if it is in the Clang code generation phase.

The implementation of the "guarded" execution mode follows roughly the
state machine description in [1], though the implementation is different
(more general) nowadays.

What we want
------------

Increase the amount of code executed in SPMD mode and the use of
lightweight "guarding" schemes where appropriate.

How we get (could) there
------------------------

We propose the following two modifications in order:

  1) Move the state machine logic into the OpenMP runtime library. That
     means in SPMD mode all device threads will start the execution of
     the user code, thus emerge from the runtime, while in guarded mode
     only the master will escape the runtime and the other threads will
     idle in their state machine code that is now just "hidden".

     Why:
     - The state machine code cannot be (reasonably) optimized anyway,
       moving it into the library shouldn't hurt runtime but might even
       improve compile time a little bit.
     - The change should also simplify the Clang code generation as we
       would generate structurally the same code for both execution modes
       but only the runtime library calls, or their arguments, would
       differ between them.
     - The reason we should not "just start in SPMD mode" and "repair"
       it later is simple, this way we always have semantically correct
       and executable code.
     - Finally, and most importantly, there is now only little
       difference (see above) between the two modes in the code
       generated by clang. If we later analyze the code trying to decide
       if we can use SPMD mode instead of guarded mode the analysis and
       transformation becomes much simpler.

 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, and the canonicalizations, we have a clearer
      picture of the code that is actually executed in the target
      region and all the side effects it contains. Thus, we can make an
      educated decision on the required amount of guards that prevent
      unwanted side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.

None of this was implemented yet but we plan to start in the immediate
future. Any comments, ideas, criticism is welcome!

Cheers,
  Johannes

P.S. [2-4] Provide further information on implementation and features.

[1] https://ieeexplore.ieee.org/document/7069297
[2] https://dl.acm.org/citation.cfm?id=2833161
[3] https://dl.acm.org/citation.cfm?id=3018870
[4] https://dl.acm.org/citation.cfm?id=3148189

--

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

jdoerfert at anl.gov<mailto:jdoerfert at anl.gov>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20190313/5538fc26/attachment-0001.html>