[Openmp-commits] [PATCH] D51875: [OPENMP][NVPTX] Add support for lastprivates/reductions handling in SPMD constructs with lightweight runtime.

Alexey Bataev via Phabricator via Openmp-commits openmp-commits at lists.llvm.org
Fri Sep 28 06:33:26 PDT 2018


ABataev added a comment.

In https://reviews.llvm.org/D51875#1248997, @Hahnfeld wrote:

> In https://reviews.llvm.org/D51875#1229536, @ABataev wrote:
>
> > I already described it - it breaks the compatibility with other outlined regions and breaks the whole design of the OpenMP implementation.
> >
> > [...]
> >
> > Yes, Alex Eichenberger tries to invent something, that will allow us to use something similar to ibm-devel but without breaking the design of OpenMP in the compiler. But it requires some time. But I'd like to have something working, at least.
>
>
> Just to make sure I came to the right conclusions after trying to understand the code generated since https://reviews.llvm.org/rC342738 and for documentation purposes if the following explanation is correct: The compiler generated code asks the runtime for two loop schedules, one for `distribute` and the other to implement `for`. The latter iterates in the chunk returned from the `distribute` schedule.
>
> For `lastprivate`s on `teams distribute parallel for` this means that the global value needs to be updated in the last iteration of the last `distribute` chunk. However, the outlined `parallel` region only knows whether the current thread is executing the last iteration of the `for` worksharing construct. This means the `lastprivate` value of the `parallel for` is passed back to the `distribute` loop which decides if it has just executed the last chunk and needs to write to the global value.
>  In SPMD constructs all CUDA threads are executing the `distribute` loop, but only the thread executing the last iteration of the `for` loop has seen the `lastprivate` value. However the information of which thread this is has been lost at the end of the `parallel` region. So data sharing is used to communicate the `lastprivate` value to all threads in the team that is executing the last `distribute` chunk.
>
> Assume a simple case like this:
>
>   int last;
>   #pragma omp target teams distribute parallel for map(from: last) lastprivate(last)
>   for (int i = 0; i < 10000; i++) {
>     last = i;
>   }
>
>
> Clang conceptually generates the following:
>
>   void outlined_target_fn(int *last) {
>     int *last_ds = /* get data sharing frame from runtime */
>     for (/* distribute loop from 0 to 9999 */) {
>       outlined_parallel_fn(lb, ub, last_ds);
>     }
>     if (/* received last chunk */) {
>       *last = *last_ds;
>     }
>   }
>  
>   void outlined_parallel_fn(int lb, int ub, int *last) {
>     int last_privatized;
>     for (/* for loop from lb to ub */) {
>       last_privatized = i;
>     }
>     if (/* executed last iteration of for loop */) {
>       *last = last_privatized;
>     }
>   }
>
>
> I tried to solve this problem without support from the runtime and this appears to work:
>
>   void outlined_target_fn(int *last) {
>     int last_dummy;
>     for (/* distribute loop from 0 to 9999 */) {
>       int *last_p = &last_dummy;
>       if (/* is last chunk */) {
>         last_p = last;
>       }
>       outlined_parallel_fn(lb, ub, last_p);
>     }
>   }
>  
>   void outlined_parallel_fn(int lb, int ub, int *last) {
>     int last_privatized;
>     for (/* for loop from lb to ub */) {
>       last_privatized = i;
>     }
>     if (/* executed last iteration of for loop */) {
>       *last = last_privatized;
>     }
>   }
>
>
> (Alternatively it should also be possible to set `last_p` before entering the distribute loop. This will write to `last` multiple times but the final value should stay in memory after the kernel.)
>
> As you can see the outlined parallel function is unchanged (which is probably what you mean with "breaks the compatibility", @ABataev?). This should work because all invocations of `outlined_parallel_fn` write their value of `last` into a dummy location, except the one executing the last `distribute` chunk.
>  What do you think?
>
> In https://reviews.llvm.org/D51875#1241913, @grokos wrote:
>
> > @Hahnfeld: Are the latest changes in line with your requirements/plans to reduce the memory footprint of the nvptx runtime?
>
>
> I still think it's a waste of resources to statically allocate around 1 GB on `sm_70` / 660 MB on `sm_60`. And I think it's worrying that we are adding more and more data structures because it seems convenient to quickly solve a problem. The truth seems to be that it's incredibly hard to get rid of them later on...


No, you're not correct here.

  void outlined_target_fn(int *last) {
    int *last_ds = /* get data sharing frame from runtime */
    for (/* distribute loop from 0 to 9999 */) {
      outlined_parallel_fn(lb, ub, last_ds);
    }
    if (/* received last chunk */) {
      *last = *last_ds;
    }
  }

This code is for the distribute loop. And here you have conflict without the datasharing scheme. The problem here is that this check `/* received last chunk */` is `true` for all inner loop iterations for inner `for` directive and `*last_ds` may come not from the last iteration of `for` loop, but from some other iterations. To solve this problem, we need to share the same `last_ds` between all the threads in the team.


Repository:
  rL LLVM

https://reviews.llvm.org/D51875





More information about the Openmp-commits mailing list