[llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"
Alexey Bataev via llvm-dev
llvm-dev at lists.llvm.org
Wed Mar 13 12:15:39 PDT 2019
There are tooooooo(!) many changes, I don't who's going to review sooooo
big patch. You definitely need to split it into several smaller patches.
Also, I don't like the idea adding of one more class for NVPTX codegen.
All your changes should be on top of the eixisting solution.
13.03.2019 15:08, Doerfert, Johannes пишет:
> Please consider reviewing the code for the proposed approach here:
> Initial tests, e.g., on the nw (needleman-wunsch) benchmark in the
> rodinia 3.1 benchmark suite, showed 30% improvement after SPMD mode was
> enabled automatically. The code in nw is conceptually equivalent to the
> first example in the "to_SPMD_mode.ll" test case that can be found here:
> The implementation is missing key features but one should be able to see
> the overall design by now. Once accepted, the missing features and more
> optimizations will be added.
> On 01/22, Johannes Doerfert wrote:
>> Where we are
>> Currently, when we generate OpenMP target offloading code for GPUs, we
>> use sufficient syntactic criteria to decide between two execution modes:
>> 1) SPMD -- All target threads (in an OpenMP team) run all the code.
>> 2) "Guarded" -- The master thread (of an OpenMP team) runs the user
>> code. If an OpenMP distribute region is encountered, thus
>> if all threads (in the OpenMP team) are supposed to
>> execute the region, the master wakes up the idling
>> worker threads and points them to the correct piece of
>> code for distributed execution.
>> For a variety of reasons we (generally) prefer the first execution mode.
>> However, depending on the code, that might not be valid, or we might
>> just not know if it is in the Clang code generation phase.
>> The implementation of the "guarded" execution mode follows roughly the
>> state machine description in , though the implementation is different
>> (more general) nowadays.
>> What we want
>> Increase the amount of code executed in SPMD mode and the use of
>> lightweight "guarding" schemes where appropriate.
>> How we get (could) there
>> We propose the following two modifications in order:
>> 1) Move the state machine logic into the OpenMP runtime library. That
>> means in SPMD mode all device threads will start the execution of
>> the user code, thus emerge from the runtime, while in guarded mode
>> only the master will escape the runtime and the other threads will
>> idle in their state machine code that is now just "hidden".
>> - The state machine code cannot be (reasonably) optimized anyway,
>> moving it into the library shouldn't hurt runtime but might even
>> improve compile time a little bit.
>> - The change should also simplify the Clang code generation as we
>> would generate structurally the same code for both execution modes
>> but only the runtime library calls, or their arguments, would
>> differ between them.
>> - The reason we should not "just start in SPMD mode" and "repair"
>> it later is simple, this way we always have semantically correct
>> and executable code.
>> - Finally, and most importantly, there is now only little
>> difference (see above) between the two modes in the code
>> generated by clang. If we later analyze the code trying to decide
>> if we can use SPMD mode instead of guarded mode the analysis and
>> transformation becomes much simpler.
>> 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
>> e.g., through the runtime library calls used, and that tries to
>> convert it into the SPMD mode potentially by introducing lightweight
>> guards in the process.
>> - After the inliner, and the canonicalizations, we have a clearer
>> picture of the code that is actually executed in the target
>> region and all the side effects it contains. Thus, we can make an
>> educated decision on the required amount of guards that prevent
>> unwanted side effects from happening after a move to SPMD mode.
>> - At this point we can more easily introduce different schemes to
>> avoid side effects by threads that were not supposed to run. We
>> can decide if a state machine is needed, conditionals should be
>> employed, masked instructions are appropriate, or "dummy" local
>> storage can be used to hide the side effect from the outside
>> None of this was implemented yet but we plan to start in the immediate
>> future. Any comments, ideas, criticism is welcome!
>> P.S. [2-4] Provide further information on implementation and features.
>>  https://ieeexplore.ieee.org/document/7069297
>>  https://dl.acm.org/citation.cfm?id=2833161
>>  https://dl.acm.org/citation.cfm?id=3018870
>>  https://dl.acm.org/citation.cfm?id=3148189
>> Johannes Doerfert
>> Argonne National Laboratory
>> Lemont, IL 60439, USA
>> jdoerfert at anl.gov
-------------- next part --------------
An HTML attachment was scrubbed...
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 833 bytes
Desc: OpenPGP digital signature
More information about the llvm-dev