[cfe-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"
Alexey Bataev via cfe-dev
cfe-dev at lists.llvm.org
Tue Jan 22 10:52:42 PST 2019
The globalization for the local variables, for example. It must be
implemented in the compiler to get the good performance, not in the runtime.
-------------
Best regards,
Alexey Bataev
22.01.2019 13:43, Doerfert, Johannes Rudolf пишет:
> Could you elaborate on what you refer to wrt data sharing. What do we
> currently do in the clang code generation that we could not
> effectively implement in the runtime, potentially with support of an
> llvm pass.
>
> Thanks,
> James
>
> Get Outlook for Android <https://aka.ms/ghei36>
>
> ------------------------------------------------------------------------
> *From:* Alexey Bataev <a.bataev at outlook.com>
> *Sent:* Tuesday, January 22, 2019 12:34:01 PM
> *To:* Doerfert, Johannes Rudolf; cfe-dev at lists.llvm.org
> *Cc:* openmp-dev at lists.llvm.org; LLVM-Dev; Finkel, Hal J.; Alexey
> Bataev; Arpith Chacko Jacob
> *Subject:* Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>
>
>
> -------------
> Best regards,
> Alexey Bataev
> 22.01.2019 13:17, Doerfert, Johannes Rudolf пишет:
>> Where we are
>> ------------
>>
>> Currently, when we generate OpenMP target offloading code for GPUs, we
>> use sufficient syntactic criteria to decide between two execution modes:
>> 1) SPMD -- All target threads (in an OpenMP team) run all the code.
>> 2) "Guarded" -- The master thread (of an OpenMP team) runs the user
>> code. If an OpenMP distribute region is encountered, thus
>> if all threads (in the OpenMP team) are supposed to
>> execute the region, the master wakes up the idling
>> worker threads and points them to the correct piece of
>> code for distributed execution.
>>
>> For a variety of reasons we (generally) prefer the first execution mode.
>> However, depending on the code, that might not be valid, or we might
>> just not know if it is in the Clang code generation phase.
>>
>> The implementation of the "guarded" execution mode follows roughly the
>> state machine description in [1], though the implementation is different
>> (more general) nowadays.
>>
>>
>> What we want
>> ------------
>>
>> Increase the amount of code executed in SPMD mode and the use of
>> lightweight "guarding" schemes where appropriate.
>>
>>
>> How we get (could) there
>> ------------------------
>>
>> We propose the following two modifications in order:
>>
>> 1) Move the state machine logic into the OpenMP runtime library. That
>> means in SPMD mode all device threads will start the execution of
>> the user code, thus emerge from the runtime, while in guarded mode
>> only the master will escape the runtime and the other threads will
>> idle in their state machine code that is now just "hidden".
>>
>> Why:
>> - The state machine code cannot be (reasonably) optimized anyway,
>> moving it into the library shouldn't hurt runtime but might even
>> improve compile time a little bit.
>> - The change should also simplify the Clang code generation as we
>> would generate structurally the same code for both execution modes
>> but only the runtime library calls, or their arguments, would
>> differ between them.
>> - The reason we should not "just start in SPMD mode" and "repair"
>> it later is simple, this way we always have semantically correct
>> and executable code.
>> - Finally, and most importantly, there is now only little
>> difference (see above) between the two modes in the code
>> generated by clang. If we later analyze the code trying to decide
>> if we can use SPMD mode instead of guarded mode the analysis and
>> transformation becomes much simpler.
>
> The last item is wrong, unfortunately. A lot of things in the codegen
> depend on the execution mode, e.g. correct support of the
> data-sharing. Of course, we can try to generalize the codegen and rely
> completely on the runtime, but the performance is going to be very poor.
>
> We still need static analysis in the compiler. I agree, that it is
> better to move this analysis to the backend, at least after the
> inlining, but at the moment it is not possible. We need the support
> for the late outlining, which will allow to implement better detection
> of the SPMD constructs + improve performance.
>
>> 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
>> e.g., through the runtime library calls used, and that tries to
>> convert it into the SPMD mode potentially by introducing lightweight
>> guards in the process.
>>
>> Why:
>> - After the inliner, and the canonicalizations, we have a clearer
>> picture of the code that is actually executed in the target
>> region and all the side effects it contains. Thus, we can make an
>> educated decision on the required amount of guards that prevent
>> unwanted side effects from happening after a move to SPMD mode.
>> - At this point we can more easily introduce different schemes to
>> avoid side effects by threads that were not supposed to run. We
>> can decide if a state machine is needed, conditionals should be
>> employed, masked instructions are appropriate, or "dummy" local
>> storage can be used to hide the side effect from the outside
>> world.
>>
>>
>> None of this was implemented yet but we plan to start in the immediate
>> future. Any comments, ideas, criticism is welcome!
>>
>>
>> Cheers,
>> Johannes
>>
>>
>> P.S. [2-4] Provide further information on implementation and features.
>>
>> [1] https://ieeexplore.ieee.org/document/7069297
>> [2] https://dl.acm.org/citation.cfm?id=2833161
>> [3] https://dl.acm.org/citation.cfm?id=3018870
>> [4] https://dl.acm.org/citation.cfm?id=3148189
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190122/daaed8e8/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190122/daaed8e8/attachment.sig>
More information about the cfe-dev
mailing list