[cfe-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

Tue Mar 19 15:02:42 PDT 2019

Hi Xinmin,
I would not suggest to consider this patch as final. It really requires a lot of work to land. Plus, I think, there is a way to improve non-SPMD mode with a new design. I cannot 100% guarantee it, but most probably there is a better design, which will allow to use all the threads (just like in SPMD), plus, possibly, to simplify the runtime library for non-SPMD mode.

Best regards,
Alexey Bataev

> 19 марта 2019 г., в 17:41, Tian, Xinmin <xinmin.tian at intel.com> написал(а):
> 
> Johannes, below is a case, we are investigating how to use state-machine with library calls.
> 
> #include <math.h>
> 
> #define MAX 1024
> 
> #pragma omp declare target
> double A[MAX], B[MAX], C[MAX];
> void __attribute__ ((noinline)) Compute()
> {
>      for (int i = 0; i < MAX; i++) {
>            C[i] += A[i] * B[0];
>      }
> 
>    #pragma omp parallel for
>      for (int i = 0; i < MAX; i++) {
>            C[i] += A[i] * B[i];
>      }
> }
> #pragma omp end declare target
> 
> int main() {
>  for (int i = 0; i < MAX; i++) {
>    A[i] = i - 1;
>    B[i] = i + 1;
>  }
> 
>  #pragma omp target teams distribute num_teams(4) thread_limit(16)
>      for (int i = 0; i < 8888; i++) {
>           Compute();
>           for (k=0; k < 100) { 
>               #pragma omp parallel for simd 
>               for (int i = 0; i < MAX; i++)
>                   Compute();
>           }
>      }
>  //printf("PASSED\n");
>  return 0;
> }
> 
> Xinmin 
> 
> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Doerfert, Johannes via llvm-dev
> Sent: Wednesday, March 13, 2019 12:09 PM
> To: cfe-dev at lists.llvm.org
> Cc: LLVM-Dev <llvm-dev at lists.llvm.org>; Alexey Bataev <a.bataev at hotmail.com>; Arpith Chacko Jacob <acjacob at us.ibm.com>; openmp-dev at lists.llvm.org
> Subject: Re: [llvm-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"
> 
> Please consider reviewing the code for the proposed approach here:
>  https://reviews.llvm.org/D57460
> 
> Initial tests, e.g., on the nw (needleman-wunsch) benchmark in the rodinia 3.1 benchmark suite, showed 30% improvement after SPMD mode was enabled automatically. The code in nw is conceptually equivalent to the first example in the "to_SPMD_mode.ll" test case that can be found here:
>  https://reviews.llvm.org/D57460#change-sBfg7kuN4Bid
> 
> The implementation is missing key features but one should be able to see the overall design by now. Once accepted, the missing features and more optimizations will be added.
> 
> 
>> On 01/22, Johannes Doerfert wrote:
>> Where we are
>> ------------
>> 
>> Currently, when we generate OpenMP target offloading code for GPUs, we 
>> use sufficient syntactic criteria to decide between two execution modes:
>>  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
>>  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
>>                  code. If an OpenMP distribute region is encountered, thus
>>                  if all threads (in the OpenMP team) are supposed to
>>                  execute the region, the master wakes up the idling
>>                  worker threads and points them to the correct piece of
>>                  code for distributed execution.
>> 
>> For a variety of reasons we (generally) prefer the first execution mode.
>> However, depending on the code, that might not be valid, or we might 
>> just not know if it is in the Clang code generation phase.
>> 
>> The implementation of the "guarded" execution mode follows roughly the 
>> state machine description in [1], though the implementation is 
>> different (more general) nowadays.
>> 
>> 
>> What we want
>> ------------
>> 
>> Increase the amount of code executed in SPMD mode and the use of 
>> lightweight "guarding" schemes where appropriate.
>> 
>> 
>> How we get (could) there
>> ------------------------
>> 
>> We propose the following two modifications in order:
>> 
>>  1) Move the state machine logic into the OpenMP runtime library. That
>>     means in SPMD mode all device threads will start the execution of
>>     the user code, thus emerge from the runtime, while in guarded mode
>>     only the master will escape the runtime and the other threads will
>>     idle in their state machine code that is now just "hidden".
>> 
>>     Why:
>>     - The state machine code cannot be (reasonably) optimized anyway,
>>       moving it into the library shouldn't hurt runtime but might even
>>       improve compile time a little bit.
>>     - The change should also simplify the Clang code generation as we
>>       would generate structurally the same code for both execution modes
>>       but only the runtime library calls, or their arguments, would
>>       differ between them.
>>     - The reason we should not "just start in SPMD mode" and "repair"
>>       it later is simple, this way we always have semantically correct
>>       and executable code.
>>     - Finally, and most importantly, there is now only little
>>       difference (see above) between the two modes in the code
>>       generated by clang. If we later analyze the code trying to decide
>>       if we can use SPMD mode instead of guarded mode the analysis and
>>       transformation becomes much simpler.
>> 
>> 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
>>    e.g., through the runtime library calls used, and that tries to
>>    convert it into the SPMD mode potentially by introducing lightweight
>>    guards in the process.
>> 
>>    Why:
>>    - After the inliner, and the canonicalizations, we have a clearer
>>      picture of the code that is actually executed in the target
>>      region and all the side effects it contains. Thus, we can make an
>>      educated decision on the required amount of guards that prevent
>>      unwanted side effects from happening after a move to SPMD mode.
>>    - At this point we can more easily introduce different schemes to
>>      avoid side effects by threads that were not supposed to run. We
>>      can decide if a state machine is needed, conditionals should be
>>      employed, masked instructions are appropriate, or "dummy" local
>>      storage can be used to hide the side effect from the outside
>>      world.
>> 
>> 
>> None of this was implemented yet but we plan to start in the immediate 
>> future. Any comments, ideas, criticism is welcome!
>> 
>> 
>> Cheers,
>>  Johannes
>> 
>> 
>> P.S. [2-4] Provide further information on implementation and features.
>> 
>> [1] https://ieeexplore.ieee.org/document/7069297
>> [2] https://dl.acm.org/citation.cfm?id=2833161
>> [3] https://dl.acm.org/citation.cfm?id=3018870
>> [4] https://dl.acm.org/citation.cfm?id=3148189
>> 
>> 
>> --
>> 
>> Johannes Doerfert
>> Researcher
>> 
>> Argonne National Laboratory
>> Lemont, IL 60439, USA
>> 
>> jdoerfert at anl.gov
> 
> 
> 
> -- 
> 
> Johannes Doerfert
> Researcher
> 
> Argonne National Laboratory
> Lemont, IL 60439, USA
> 
> jdoerfert at anl.gov