[cfe-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

Wed Mar 13 12:15:39 PDT 2019

There are tooooooo(!) many changes, I don't who's going to review sooooo
big patch. You definitely need to split it into several smaller patches.
Also, I don't like the idea adding of one more class for NVPTX codegen.
All your changes should be on top of the eixisting solution.

-------------
Best regards,
Alexey Bataev

13.03.2019 15:08, Doerfert, Johannes пишет:
> Please consider reviewing the code for the proposed approach here:
>   https://reviews.llvm.org/D57460
>
> Initial tests, e.g., on the nw (needleman-wunsch) benchmark in the
> rodinia 3.1 benchmark suite, showed 30% improvement after SPMD mode was
> enabled automatically. The code in nw is conceptually equivalent to the
> first example in the "to_SPMD_mode.ll" test case that can be found here:
>   https://reviews.llvm.org/D57460#change-sBfg7kuN4Bid
>
> The implementation is missing key features but one should be able to see
> the overall design by now. Once accepted, the missing features and more
> optimizations will be added.
>
>
> On 01/22, Johannes Doerfert wrote:
>> Where we are
>> ------------
>>
>> Currently, when we generate OpenMP target offloading code for GPUs, we
>> use sufficient syntactic criteria to decide between two execution modes:
>>   1)      SPMD -- All target threads (in an OpenMP team) run all the code.
>>   2) "Guarded" -- The master thread (of an OpenMP team) runs the user
>>                   code. If an OpenMP distribute region is encountered, thus
>>                   if all threads (in the OpenMP team) are supposed to
>>                   execute the region, the master wakes up the idling
>>                   worker threads and points them to the correct piece of
>>                   code for distributed execution.
>>
>> For a variety of reasons we (generally) prefer the first execution mode.
>> However, depending on the code, that might not be valid, or we might
>> just not know if it is in the Clang code generation phase.
>>
>> The implementation of the "guarded" execution mode follows roughly the
>> state machine description in [1], though the implementation is different
>> (more general) nowadays.
>>
>>
>> What we want
>> ------------
>>
>> Increase the amount of code executed in SPMD mode and the use of
>> lightweight "guarding" schemes where appropriate.
>>
>>
>> How we get (could) there
>> ------------------------
>>
>> We propose the following two modifications in order:
>>
>>   1) Move the state machine logic into the OpenMP runtime library. That
>>      means in SPMD mode all device threads will start the execution of
>>      the user code, thus emerge from the runtime, while in guarded mode
>>      only the master will escape the runtime and the other threads will
>>      idle in their state machine code that is now just "hidden".
>>
>>      Why:
>>      - The state machine code cannot be (reasonably) optimized anyway,
>>        moving it into the library shouldn't hurt runtime but might even
>>        improve compile time a little bit.
>>      - The change should also simplify the Clang code generation as we
>>        would generate structurally the same code for both execution modes
>>        but only the runtime library calls, or their arguments, would
>>        differ between them.
>>      - The reason we should not "just start in SPMD mode" and "repair"
>>        it later is simple, this way we always have semantically correct
>>        and executable code.
>>      - Finally, and most importantly, there is now only little
>>        difference (see above) between the two modes in the code
>>        generated by clang. If we later analyze the code trying to decide
>>        if we can use SPMD mode instead of guarded mode the analysis and
>>        transformation becomes much simpler.
>>
>>  2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
>>     e.g., through the runtime library calls used, and that tries to
>>     convert it into the SPMD mode potentially by introducing lightweight
>>     guards in the process.
>>
>>     Why:
>>     - After the inliner, and the canonicalizations, we have a clearer
>>       picture of the code that is actually executed in the target
>>       region and all the side effects it contains. Thus, we can make an
>>       educated decision on the required amount of guards that prevent
>>       unwanted side effects from happening after a move to SPMD mode.
>>     - At this point we can more easily introduce different schemes to
>>       avoid side effects by threads that were not supposed to run. We
>>       can decide if a state machine is needed, conditionals should be
>>       employed, masked instructions are appropriate, or "dummy" local
>>       storage can be used to hide the side effect from the outside
>>       world.
>>
>>
>> None of this was implemented yet but we plan to start in the immediate
>> future. Any comments, ideas, criticism is welcome!
>>
>>
>> Cheers,
>>   Johannes
>>
>>
>> P.S. [2-4] Provide further information on implementation and features.
>>
>> [1] https://ieeexplore.ieee.org/document/7069297
>> [2] https://dl.acm.org/citation.cfm?id=2833161
>> [3] https://dl.acm.org/citation.cfm?id=3018870
>> [4] https://dl.acm.org/citation.cfm?id=3148189
>>
>>
>> -- 
>>
>> Johannes Doerfert
>> Researcher
>>
>> Argonne National Laboratory
>> Lemont, IL 60439, USA
>>
>> jdoerfert at anl.gov
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190313/c7b21e92/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190313/c7b21e92/attachment.sig>