[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing

Fri Sep 28 05:27:34 PDT 2018

gtbercea added a comment.

In https://reviews.llvm.org/D52434#1248844, @Hahnfeld wrote:

> Just tested this and got very weird results for register usage:
>
>   void func(double *a) {
>     #pragma omp target teams distribute parallel for map(a[0:100]) // dist_schedule(static)
>     for (int i = 0; i < 100; i++) {
>       a[i]++;
>     }
>   }
>
>
> Compiling with current trunk for `sm_60` (Pascal): 29 registers
>  Adding `dist_schedule(static)` (the previous default): 19 registers
>  For reference: `dist_schedule(static, 128)` also uses 29 registers
>
> Any ideas? This significantly slows down STREAM...

Jonas, without an explicit dist_schedule clause the program will run with schedule(static, <number of threads in block>). It looks like that happens fine since you get the same register count in the explicit static chunk variant as in the default case.

The difference you see in register count is (I suspect) driven by the runtime code (less registers for non-chunked than for chunked). I am currently investigating this and trying to find ways to reduce this number.

One big problem your code has is that the trip count is incredibly small, especially for STREAM and especially on GPUs. You need a much larger loop size otherwise the timings will be dominated by OpenMP setups costs.

Repository:
  rC Clang

https://reviews.llvm.org/D52434