[cfe-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"
Alexey Bataev via cfe-dev
cfe-dev at lists.llvm.org
Tue Mar 19 15:02:42 PDT 2019
Hi Xinmin,
I would not suggest to consider this patch as final. It really requires a lot of work to land. Plus, I think, there is a way to improve non-SPMD mode with a new design. I cannot 100% guarantee it, but most probably there is a better design, which will allow to use all the threads (just like in SPMD), plus, possibly, to simplify the runtime library for non-SPMD mode.
Best regards,
Alexey Bataev
> 19 марта 2019 г., в 17:41, Tian, Xinmin <xinmin.tian at intel.com> написал(а):
>
> Johannes, below is a case, we are investigating how to use state-machine with library calls.
>
> #include <math.h>
>
> #define MAX 1024
>
> #pragma omp declare target
> double A[MAX], B[MAX], C[MAX];
> void __attribute__ ((noinline)) Compute()
> {
> for (int i = 0; i < MAX; i++) {
> C[i] += A[i] * B[0];
> }
>
> #pragma omp parallel for
> for (int i = 0; i < MAX; i++) {
> C[i] += A[i] * B[i];
> }
> }
> #pragma omp end declare target
>
> int main() {
> for (int i = 0; i < MAX; i++) {
> A[i] = i - 1;
> B[i] = i + 1;
> }
>
> #pragma omp target teams distribute num_teams(4) thread_limit(16)
> for (int i = 0; i < 8888; i++) {
> Compute();
> for (k=0; k < 100) {
> #pragma omp parallel for simd
> for (int i = 0; i < MAX; i++)
> Compute();
> }
> }
> //printf("PASSED\n");
> return 0;
> }
>
> Xinmin
>
> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Doerfert, Johannes via llvm-dev
> Sent: Wednesday, March 13, 2019 12:09 PM
> To: cfe-dev at lists.llvm.org
> Cc: LLVM-Dev <llvm-dev at lists.llvm.org>; Alexey Bataev <a.bataev at hotmail.com>; Arpith Chacko Jacob <acjacob at us.ibm.com>; openmp-dev at lists.llvm.org
> Subject: Re: [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"
>
> Please consider reviewing the code for the proposed approach here:
> https://reviews.llvm.org/D57460
>
> Initial tests, e.g., on the nw (needleman-wunsch) benchmark in the rodinia 3.1 benchmark suite, showed 30% improvement after SPMD mode was enabled automatically. The code in nw is conceptually equivalent to the first example in the "to_SPMD_mode.ll" test case that can be found here:
> https://reviews.llvm.org/D57460#change-sBfg7kuN4Bid
>
> The implementation is missing key features but one should be able to see the overall design by now. Once accepted, the missing features and more optimizations will be added.
>
>
>> On 01/22, Johannes Doerfert wrote:
>> Where we are
>> ------------
>>
>> Currently, when we generate OpenMP target offloading code for GPUs, we
>> use sufficient syntactic criteria to decide between two execution modes:
>> 1) SPMD -- All target threads (in an OpenMP team) run all the code.
>> 2) "Guarded" -- The master thread (of an OpenMP team) runs the user
>> code. If an OpenMP distribute region is encountered, thus
>> if all threads (in the OpenMP team) are supposed to
>> execute the region, the master wakes up the idling
>> worker threads and points them to the correct piece of
>> code for distributed execution.
>>
>> For a variety of reasons we (generally) prefer the first execution mode.
>> However, depending on the code, that might not be valid, or we might
>> just not know if it is in the Clang code generation phase.
>>
>> The implementation of the "guarded" execution mode follows roughly the
>> state machine description in [1], though the implementation is
>> different (more general) nowadays.
>>
>>
>> What we want
>> ------------
>>
>> Increase the amount of code executed in SPMD mode and the use of
>> lightweight "guarding" schemes where appropriate.
>>
>>
>> How we get (could) there
>> ------------------------
>>
>> We propose the following two modifications in order:
>>
>> 1) Move the state machine logic into the OpenMP runtime library. That
>> means in SPMD mode all device threads will start the execution of
>> the user code, thus emerge from the runtime, while in guarded mode
>> only the master will escape the runtime and the other threads will
>> idle in their state machine code that is now just "hidden".
>>
>> Why:
>> - The state machine code cannot be (reasonably) optimized anyway,
>> moving it into the library shouldn't hurt runtime but might even
>> improve compile time a little bit.
>> - The change should also simplify the Clang code generation as we
>> would generate structurally the same code for both execution modes
>> but only the runtime library calls, or their arguments, would
>> differ between them.
>> - The reason we should not "just start in SPMD mode" and "repair"
>> it later is simple, this way we always have semantically correct
>> and executable code.
>> - Finally, and most importantly, there is now only little
>> difference (see above) between the two modes in the code
>> generated by clang. If we later analyze the code trying to decide
>> if we can use SPMD mode instead of guarded mode the analysis and
>> transformation becomes much simpler.
>>
>> 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
>> e.g., through the runtime library calls used, and that tries to
>> convert it into the SPMD mode potentially by introducing lightweight
>> guards in the process.
>>
>> Why:
>> - After the inliner, and the canonicalizations, we have a clearer
>> picture of the code that is actually executed in the target
>> region and all the side effects it contains. Thus, we can make an
>> educated decision on the required amount of guards that prevent
>> unwanted side effects from happening after a move to SPMD mode.
>> - At this point we can more easily introduce different schemes to
>> avoid side effects by threads that were not supposed to run. We
>> can decide if a state machine is needed, conditionals should be
>> employed, masked instructions are appropriate, or "dummy" local
>> storage can be used to hide the side effect from the outside
>> world.
>>
>>
>> None of this was implemented yet but we plan to start in the immediate
>> future. Any comments, ideas, criticism is welcome!
>>
>>
>> Cheers,
>> Johannes
>>
>>
>> P.S. [2-4] Provide further information on implementation and features.
>>
>> [1] https://ieeexplore.ieee.org/document/7069297
>> [2] https://dl.acm.org/citation.cfm?id=2833161
>> [3] https://dl.acm.org/citation.cfm?id=3018870
>> [4] https://dl.acm.org/citation.cfm?id=3148189
>>
>>
>> --
>>
>> Johannes Doerfert
>> Researcher
>>
>> Argonne National Laboratory
>> Lemont, IL 60439, USA
>>
>> jdoerfert at anl.gov
>
>
>
> --
>
> Johannes Doerfert
> Researcher
>
> Argonne National Laboratory
> Lemont, IL 60439, USA
>
> jdoerfert at anl.gov
More information about the cfe-dev
mailing list