[Openmp-dev] nested parallelism in libomptarget-nvptx

Sun Sep 9 01:30:05 PDT 2018

Hi Doru,

thanks for the link. However I don't think these experiments looked at 
"nested parallelism", at least not how I understand it.
According to the OpenMP standard, the following snippet shows two 
parallel regions, the second one is "nested":
#pragma omp parallel // (1)
{
   #pragma omp parallel // (2)
   { }
}
Assuming that (1) is the outermost parallel region (ie it's not nested 
in yet another parallel) it will probably be "active", meaning that it's 
executed by more than one thread. For (2) the implementation can decide 
whether it supports "nested parallelism" or if it serializes the 
parallel region (making it "inactive" in OpenMP's terminology).

For comparison the paper you linked evaluates implementations for 
something like the following:
#pragma omp target teams // (1)
{
   #pragma omp parallel // (2)
   { }
}
This is different in that (2) is the first "parallel" region on the 
device (even though it may be the second level of parallelism when 
mapped onto a GPU). From my understanding Clang trunk already handles 
this and I agree that this use case is important (see performance 
comparison in the linked paper).

Back to my original question: Do we need to support "nested 
parallelism"?
#pragma omp target teams // (1)
#pragma omp parallel // (2)
#pragma omp parallel // (3)
{ }
This would be a third level of parallelism when executing on a GPU and 
would require data sharing from worker threads of (2) (master threads of 
(3)) to worker threads of (3).

Thanks,
Jonas

On 2018-09-09 00:51, Gheorghe-Teod Bercea wrote:
> Hi Jonas,
> 
> The experiments are in this paper:
> https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767 [1]
> 
> In case you can't access the paper I'm attaching it here.
> 
> Thanks,
> 
> --Doru
> 
> From:        Jonas Hahnfeld <hahnjo at hahnjo.de>
> To:        Gheorghe-Teod Bercea <Gheorghe-Teod.Bercea at ibm.com>
> Cc:        Hal Finkel <hfinkel at anl.gov>, Alexey Bataev
> <alexey.bataev at ibm.com>, openmp-dev at lists.llvm.org
> Date:        09/08/2018 05:10 AM
> Subject:        Re: [Openmp-dev] nested parallelism in
> libomptarget-nvptx
> 
> -------------------------
> 
> Hi Doru,
> 
> On 2018-09-07 22:03, Gheorghe-Teod Bercea wrote:
>> Hi Hal,
>> 
>> At least as far as we are aware, the number of use cases where the
>> nested parallel scheme would be used is quite small. Most of the use
>> cases of OpenMP on GPUs have a single level of parallelism which is
>> typically SPMD-like to achieve as much performance as possible. That
>> said there is some merit to having a nested parallelism scheme
> because
>> when it is helpful it typically is very helpful.
>> 
>> As a novelty point to ykt-clang I would suggest that whichever
> scheme
>> (or schemes) we decide to use, they should be applied only at the
>> request of the user. This is because we can do a better code gen job
>> for more OpenMP patterns when using existing schemes (generic and
>> SPMD) if we know at compile time if there will be no second level
>> parallelism in use. This is due to some changes in implementation in
>> trunk compared to ykt-clang.
> 
> I agree: Even then we may be able to restructure the application to be
> 
> more performant and portable without nested parallelism.
> 
>> Regarding which scheme to use there were two which were floated
> around
>> based on discussions with users: (1) the current scheme in ykt-clang
>> which enables the code in both inner and outer parallel loops to be
>> executed in parallel and (2) a scheme where the outer loop code is
>> executed by one thread and the innermost loop is executed by all
>> threads (this was requested by users at one point, I assume this is
>> still the case).
>> 
>> Since ykt-clang only supports the fist scheme when we ran
> performance
>> tests comparing nested parallelism against no nested parallelism we
>> got anywhere from 4x slowdown to 32x speedup depending on the: ratio
>> of outer:inner iterations, the work size in the innermost loop,
>> reductions, atomics and memory coalescing. About 80% of the number
> of
>> cases we tried showed speed-ups with some showing significant
>> speed-ups.
>> I would very much be in favour of having at least this scheme
>> supported since it looks like it could be useful.
> 
> Interesting. Are these experiments public? I'd be interested to see
> the
> codes that benefit from nested parallelism.
> IIRC OpenACC doesn't have this feature, so I expect this to be corner
> cases.
> 
> Regards,
> Jonas
> 
> 
> 
> Links:
> ------
> [1] 
> https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767