[PATCH] D51875: [OPENMP][NVPTX] Add support for lastprivates/reductions handling in SPMD constructs with lightweight runtime.

Fri Sep 28 06:13:34 PDT 2018

Hahnfeld added a comment.

In https://reviews.llvm.org/D51875#1229536, @ABataev wrote:

> I already described it - it breaks the compatibility with other outlined regions and breaks the whole design of the OpenMP implementation.
>
> [...]
>
> Yes, Alex Eichenberger tries to invent something, that will allow us to use something similar to ibm-devel but without breaking the design of OpenMP in the compiler. But it requires some time. But I'd like to have something working, at least.

Just to make sure I came to the right conclusions after trying to understand the code generated since https://reviews.llvm.org/rC342738 and for documentation purposes if the following explanation is correct: The compiler generated code asks the runtime for two loop schedules, one for `distribute` and the other to implement `for`. The latter iterates in the chunk returned from the `distribute` schedule.

For `lastprivate`s on `teams distribute parallel for` this means that the global value needs to be updated in the last iteration of the last `distribute` chunk. However, the outlined `parallel` region only knows whether the current thread is executing the last iteration of the `for` worksharing construct. This means the `lastprivate` value of the `parallel for` is passed back to the `distribute` loop which decides if it has just executed the last chunk and needs to write to the global value.
In SPMD constructs all CUDA threads are executing the `distribute` loop, but only the thread executing the last iteration of the `for` loop has seen the `lastprivate` value. However the information of which thread this is has been lost at the end of the `parallel` region. So data sharing is used to communicate the `lastprivate` value to all threads in the team that is executing the last `distribute` chunk.

Assume a simple case like this:

  int last;
  #pragma omp target teams distribute parallel for map(from: last) lastprivate(last)
  for (int i = 0; i < 10000; i++) {
    last = i;
  }

Clang conceptually generates the following:

  void outlined_target_fn(int *last) {
    int *last_ds = /* get data sharing frame from runtime */
    for (/* distribute loop from 0 to 9999 */) {
      outlined_parallel_fn(lb, ub, last_ds);
    }
    if (/* received last chunk */) {
      *last = *last_ds;
    }
  }

  void outlined_parallel_fn(int lb, int ub, int *last) {
    int last_privatized;
    for (/* for loop from lb to ub */) {
      last_privatized = i;
    }
    if (/* executed last iteration of for loop */) {
      *last = last_privatized;
    }
  }

I tried to solve this problem without support from the runtime and this appears to work:

  void outlined_target_fn(int *last) {
    int last_dummy;
    for (/* distribute loop from 0 to 9999 */) {
      int *last_p = &last_dummy;
      if (/* is last chunk */) {
        last_p = last;
      }
      outlined_parallel_fn(lb, ub, last_p);
    }
  }

  void outlined_parallel_fn(int lb, int ub, int *last) {
    int last_privatized;
    for (/* for loop from lb to ub */) {
      last_privatized = i;
    }
    if (/* executed last iteration of for loop */) {
      *last = last_privatized;
    }
  }

(Alternatively it should also be possible to set `last_p` before entering the distribute loop. This will write to `last` multiple times but the final value should stay in memory after the kernel.)

As you can see the outlined parallel function is unchanged (which is probably what you mean with "breaks the compatibility", @ABataev?). This should work because all invocations of `outlined_parallel_fn` write their value of `last` into a dummy location, except the one executing the last `distribute` chunk.
What do you think?

In https://reviews.llvm.org/D51875#1241913, @grokos wrote:

> @Hahnfeld: Are the latest changes in line with your requirements/plans to reduce the memory footprint of the nvptx runtime?

I still think it's a waste of resources to statically allocate around 1 GB on `sm_70` / 660 MB on `sm_60`. And I think it's worrying that we are adding more and more data structures because it seems convenient to quickly solve a problem. The truth seems to be that it's incredibly hard to get rid of them later on...

Repository:
  rL LLVM

https://reviews.llvm.org/D51875