[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

Mon Mar 25 10:20:01 PDT 2019

As the email thread is by now long and hard to follow, I wanted to start
a new branch explaining the patches for reviews, how I think we
should proceed from here, and some feature/problem discussion.

The patches:
-----------

Currently, the initial implementation* is split across the following
seven patches shown in "dependence order":
  [OpenMP][Helper]     https://reviews.llvm.org/D59424
  [OpenMP][Runtime]    https://reviews.llvm.org/D59319
  [Clang][Helper]      https://reviews.llvm.org/D59418
  [Clang][Helper]      https://reviews.llvm.org/D59420
  [Clang][Helper]      https://reviews.llvm.org/D59421
  [Clang][Codegen]     https://reviews.llvm.org/D59328
  [LLVM][Optimization] https://reviews.llvm.org/D59331

* The original, now abandoned, aggregate patch can be found here:
  https://reviews.llvm.org/D57460

Next steps:
----------

I kindly ask interested parties to post questions, comments, and
reviews. It would also be good if people could look into alternative
target device library implementations, e.g., for AMD GPUs or other
non-NVIDIA targets. This would help to test the "hardware agnostic"
hypothesis.

(Missing) features & known problems:
-----------------------------------

The initial implementation discussed above includes the following
functionalities:
  - Generate valid LLVM-IR for "omp target" with enclosed, potentially
    nested, "omp parallel" pragmas.
  - Translate non-SPMD mode regions to SPMD mode regions if that is
    valid without code changes.
  - Create a customized state machine for non-SPMD mode regions.
    Customization for now means all visible enclosed parallel regions
    are checked as part of an if-cascade and called directly before a
    potential fallback indirect call is reached.

Missing features and known problems:
  - Reductions are not supported yet. My plan is to use the ideas
    presented in by Garcia De Gonzalo et al. [1] at CGO'19 in the
    runtime and let clang emit some kind of
      "__kmpc_XXXX_reduction_begin(kind, loc)"
      "__kmpc_XXXX_reduction_end(kind, loc)"
    calls at the beginning and end of the kernel. The runtime or LLVM
    optimization should then decide on the reduction strategy.
  - Critical regions are not supported yet. The NVPTX codegen approach
    is probably fine, we just need to port it.
  - When changing non-SPMD mode kernels to SPMD mode kernels we might need
    to change the schedule decisions for loops. As a consequence, we
    might want to add a level of abstraction for these as well to make
    that simple.

Thanks,
  Johannes

[1] Automatic Generation of Warp-Level Primitives and Atomic
Instructions for Fast and Portable Parallel Reduction on GPUs Simon
Garcia De Gonzalo and Sitao Huang (University of Illinois at
Urbana–Champaign); Juan Gomez-Luna (Swiss Federal Institute of
Technology(ETH) Zurich); Simon Hammond (Sandia National Laboratories);
Onur Mutlu (Swiss Federal Institute of Technology (ETH) Zurich); Wen-mei
Hwu (University of Illinois at Urbana–Champaign) 

On 01/22, Doerfert, Johannes Rudolf via llvm-dev wrote:
> Where we are
> ------------
> 
> Currently, when we generate OpenMP target offloading code for GPUs, we
> use sufficient syntactic criteria to decide between two execution modes:
>   1)      SPMD -- All target threads (in an OpenMP team) run all the code.
>   2) "Guarded" -- The master thread (of an OpenMP team) runs the user
>                   code. If an OpenMP distribute region is encountered, thus
>                   if all threads (in the OpenMP team) are supposed to
>                   execute the region, the master wakes up the idling
>                   worker threads and points them to the correct piece of
>                   code for distributed execution.
> 
> For a variety of reasons we (generally) prefer the first execution mode.
> However, depending on the code, that might not be valid, or we might
> just not know if it is in the Clang code generation phase.
> 
> The implementation of the "guarded" execution mode follows roughly the
> state machine description in [1], though the implementation is different
> (more general) nowadays.
> 
> 
> What we want
> ------------
> 
> Increase the amount of code executed in SPMD mode and the use of
> lightweight "guarding" schemes where appropriate.
> 
> 
> How we get (could) there
> ------------------------
> 
> We propose the following two modifications in order:
> 
>   1) Move the state machine logic into the OpenMP runtime library. That
>      means in SPMD mode all device threads will start the execution of
>      the user code, thus emerge from the runtime, while in guarded mode
>      only the master will escape the runtime and the other threads will
>      idle in their state machine code that is now just "hidden".
> 
>      Why:
>      - The state machine code cannot be (reasonably) optimized anyway,
>        moving it into the library shouldn't hurt runtime but might even
>        improve compile time a little bit.
>      - The change should also simplify the Clang code generation as we
>        would generate structurally the same code for both execution modes
>        but only the runtime library calls, or their arguments, would
>        differ between them.
>      - The reason we should not "just start in SPMD mode" and "repair"
>        it later is simple, this way we always have semantically correct
>        and executable code.
>      - Finally, and most importantly, there is now only little
>        difference (see above) between the two modes in the code
>        generated by clang. If we later analyze the code trying to decide
>        if we can use SPMD mode instead of guarded mode the analysis and
>        transformation becomes much simpler.
> 
>  2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
>     e.g., through the runtime library calls used, and that tries to
>     convert it into the SPMD mode potentially by introducing lightweight
>     guards in the process.
> 
>     Why:
>     - After the inliner, and the canonicalizations, we have a clearer
>       picture of the code that is actually executed in the target
>       region and all the side effects it contains. Thus, we can make an
>       educated decision on the required amount of guards that prevent
>       unwanted side effects from happening after a move to SPMD mode.
>     - At this point we can more easily introduce different schemes to
>       avoid side effects by threads that were not supposed to run. We
>       can decide if a state machine is needed, conditionals should be
>       employed, masked instructions are appropriate, or "dummy" local
>       storage can be used to hide the side effect from the outside
>       world.
> 
> 
> None of this was implemented yet but we plan to start in the immediate
> future. Any comments, ideas, criticism is welcome!
> 
> 
> Cheers,
>   Johannes
> 
> 
> P.S. [2-4] Provide further information on implementation and features.
> 
> [1] https://ieeexplore.ieee.org/document/7069297
> [2] https://dl.acm.org/citation.cfm?id=2833161
> [3] https://dl.acm.org/citation.cfm?id=3018870
> [4] https://dl.acm.org/citation.cfm?id=3148189
> 
> 
> -- 
> 
> Johannes Doerfert
> Researcher
> 
> Argonne National Laboratory
> Lemont, IL 60439, USA
> 
> jdoerfert at anl.gov

> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

jdoerfert at anl.gov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190325/45ff424c/attachment.sig>