[Openmp-dev] nested parallelism in libomptarget-nvptx

Jonas Hahnfeld via Openmp-dev openmp-dev at lists.llvm.org
Sat Sep 8 02:10:18 PDT 2018


Hi Doru,

On 2018-09-07 22:03, Gheorghe-Teod Bercea wrote:
> Hi Hal,
> 
> At least as far as we are aware, the number of use cases where the
> nested parallel scheme would be used is quite small. Most of the use
> cases of OpenMP on GPUs have a single level of parallelism which is
> typically SPMD-like to achieve as much performance as possible. That
> said there is some merit to having a nested parallelism scheme because
> when it is helpful it typically is very helpful.
> 
> As a novelty point to ykt-clang I would suggest that whichever scheme
> (or schemes) we decide to use, they should be applied only at the
> request of the user. This is because we can do a better code gen job
> for more OpenMP patterns when using existing schemes (generic and
> SPMD) if we know at compile time if there will be no second level
> parallelism in use. This is due to some changes in implementation in
> trunk compared to ykt-clang.

I agree: Even then we may be able to restructure the application to be 
more performant and portable without nested parallelism.

> Regarding which scheme to use there were two which were floated around
> based on discussions with users: (1) the current scheme in ykt-clang
> which enables the code in both inner and outer parallel loops to be
> executed in parallel and (2) a scheme where the outer loop code is
> executed by one thread and the innermost loop is executed by all
> threads (this was requested by users at one point, I assume this is
> still the case).
> 
> Since ykt-clang only supports the fist scheme when we ran performance
> tests comparing nested parallelism against no nested parallelism we
> got anywhere from 4x slowdown to 32x speedup depending on the: ratio
> of outer:inner iterations, the work size in the innermost loop,
> reductions, atomics and memory coalescing. About 80% of the number of
> cases we tried showed speed-ups with some showing significant
> speed-ups.
> I would very much be in favour of having at least this scheme
> supported since it looks like it could be useful.

Interesting. Are these experiments public? I'd be interested to see the 
codes that benefit from nested parallelism.
IIRC OpenACC doesn't have this feature, so I expect this to be corner 
cases.

Regards,
Jonas


More information about the Openmp-dev mailing list