[Openmp-dev] nested parallelism in libomptarget-nvptx

Sat Sep 8 02:30:07 PDT 2018

On 2018-09-08 00:15, Hal Finkel wrote:
> On 09/07/2018 03:03 PM, Gheorghe-Teod Bercea wrote:
> 
>> Hi Hal,
>> 
>> At least as far as we are aware, the number of use cases where the
>> nested parallel scheme would be used is quite small. Most of the use
>> cases of OpenMP on GPUs have a single level of parallelism which is
>> typically SPMD-like to achieve as much performance as possible. That
>> said there is some merit to having a nested parallelism scheme
>> because when it is helpful it typically is very helpful.
>> 
>> As a novelty point to ykt-clang I would suggest that whichever
>> scheme (or schemes) we decide to use, they should be applied only at
>> the request of the user. This is because we can do a better code gen
>> job for more OpenMP patterns when using existing schemes (generic
>> and SPMD) if we know at compile time if there will be no second
>> level parallelism in use. This is due to some changes in
>> implementation in trunk compared to ykt-clang.
>> 
>> Regarding which scheme to use there were two which were floated
>> around based on discussions with users: (1) the current scheme in
>> ykt-clang which enables the code in both inner and outer parallel
>> loops to be executed in parallel and (2) a scheme where the outer
>> loop code is executed by one thread and the innermost loop is
>> executed by all threads (this was requested by users at one point, I
>> assume this is still the case).
>> 
>> Since ykt-clang only supports the fist scheme when we ran
>> performance tests comparing nested parallelism against no nested
>> parallelism we got anywhere from 4x slowdown to 32x speedup
>> depending on the: ratio of outer:inner iterations, the work size in
>> the innermost loop, reductions, atomics and memory coalescing. About
>> 80% of the number of cases we tried showed speed-ups with some
>> showing significant speed-ups.
>> I would very much be in favour of having at least this scheme
>> supported since it looks like it could be useful.
>> 
>> In terms of timing, we are still tied up with upstreaming at the
>> moment so we won't be attempting a new code generation scheme until
>> we are feature complete on the current ones.
> 
> Hi, Doru,
> 
> Thanks for explaining. I think that your suggestion of putting this
> behind a flag makes a lot of sense. It sounds as though, later, we
> might want different user-selectable schemes (although we might want
> to use pragmas instead of command-line flags at that point?).

Hi Hal,

at the extreme this might also mean having multiple runtime 
implementations. Without nested parallelism many per thread data 
structures can be removed, see my initial motivation: There will only be 
data sharing in the first level, no need to have buffers for worker 
threads; and many more things.
Maybe (not there yet) this will make per team data structures small 
enough to fit into shared memory instead of having queues in global 
memory that need atomics (see state-queue{,i}.h). For 
SimpleThreadPrivateContext this seems to reduce the kernel execution 
time of an empty SPMD construct with 8192 teams (read: its overhead) 
from ~20us to ~14.5us. This might become noticeable for very small 
kernels (example: one single axpy / xpay in a conjugate gradient solver 
with 1391349 elements takes around 60us with OpenACC if I interpret my 
old measurements correctly).

Regards,
Jonas