[PATCH] D51875: [OPENMP][NVPTX] Add support for lastprivates/reductions handling in SPMD constructs with lightweight runtime.

Fri Sep 28 08:03:13 PDT 2018

Hahnfeld added a comment.

In https://reviews.llvm.org/D51875#1249092, @ABataev wrote:

> In https://reviews.llvm.org/D51875#1249088, @ABataev wrote:
>
> > 1. It is not how clang works, it is how standard requires.
>

I've tried to describe how the current implementation works, based on the IR that is generated.

In https://reviews.llvm.org/D51875#1248997, @Hahnfeld wrote:

> Clang conceptually generates the following:
>
>   void outlined_target_fn(int *last) {
>     int *last_ds = /* get data sharing frame from runtime */
>     for (/* distribute loop from 0 to 9999 */) {
>       outlined_parallel_fn(lb, ub, last_ds);
>     }
>     if (/* received last chunk */) {
>       *last = *last_ds;
>     }
>   }
>  
>   void outlined_parallel_fn(int lb, int ub, int *last) {
>     int last_privatized;
>     for (/* for loop from lb to ub */) {
>       last_privatized = i;
>     }
>     if (/* executed last iteration of for loop */) {
>       *last = last_privatized;
>     }
>   }
>

Please let me know if this pseudo code conceptually doesn't match the current IR.

>> 2. Yes, it is shared between all the threads in the team and this is how it is intended to be according to the standard
> 
> The main problem with your solution is that distribute loop does not have information which thread actually executed the last
>  chunk of the loop. All the threads in the last team must execute the same check and only one shall write its private value to the original variable. But, just like I said, runtime does not provide this information to the compiler

Now you are talking about the second pseudo-code:

In https://reviews.llvm.org/D51875#1248997, @Hahnfeld wrote:

> I tried to solve this problem without support from the runtime and this appears to work:
>
>   void outlined_target_fn(int *last) {
>     int last_dummy;
>     for (/* distribute loop from 0 to 9999 */) {
>       int *last_p = &last_dummy;
>       if (/* is last chunk */) {
>         last_p = last;
>       }
>       outlined_parallel_fn(lb, ub, last_p);
>     }
>   }
>  
>   void outlined_parallel_fn(int lb, int ub, int *last) {
>     int last_privatized;
>     for (/* for loop from lb to ub */) {
>       last_privatized = i;
>     }
>     if (/* executed last iteration of for loop */) {
>       *last = last_privatized;
>     }
>   }
>

I don't see why the distribute loop cares which thread actually executes the last iteration of the `for` loop, that's only relevant in the outlined parallel region.

Repository:
  rL LLVM

https://reviews.llvm.org/D51875