[Openmp-dev] nested parallelism in libomptarget-nvptx

Mon Sep 10 05:05:30 PDT 2018

Hi Doru,

I'm getting more and more confused by your seemingly contradicting 
answers. In your first reply you wrote:
> The second level of parallelism in clang-ykt uses a scheme where all
> the threads in each warp cooperate to execute the workload of the  1st
> thread in the warp then the 2nd and so on until the workload of each
> of the 32 threads in the warp has been completed.

Accordingly, if you compile the following with clang-ykt:
#pragma omp teams
#pragma omp parallel
#pragma omp parallel
{
   /* ... */ = omp_get_num_threads();
}
the API call will return 32 because it's using "convergent parallelism" 
(see top of parallel.cu).

Hal's question was whether that makes sense to implement in Clang trunk 
when you answered
> Since ykt-clang only supports the fist scheme when we ran performance
> tests comparing nested parallelism against no nested parallelism we
> got anywhere from 4x slowdown to 32x speedup [...]
quoting a paper where you now seem to agree that the experiments only 
dealt with a single parallel in a teams construct.

I'm sure you know what you are writing about, but the loose points don't 
match for me right now.

Jonas

On 2018-09-10 00:30, Gheorghe-Teod Bercea wrote:
> Hi Jonas,
> 
> When safe to do so "target teams distribute" would use all threads in
> the team so when the "parallel for" is encountered, the second level
> of parallelism is activated. Any further nestings of "parallel for" or
> "simd" directives would be sequential.
> 
> Thanks,
> 
> --Doru
> 
> From:        Jonas Hahnfeld <hahnjo at hahnjo.de>
> To:        Gheorghe-Teod Bercea <Gheorghe-Teod.Bercea at ibm.com>
> Cc:        Alexey Bataev <alexey.bataev at ibm.com>, Hal Finkel
> <hfinkel at anl.gov>, openmp-dev at lists.llvm.org
> Date:        09/09/2018 03:27 PM
> Subject:        Re: [Openmp-dev] nested parallelism in
> libomptarget-nvptx
> 
> -------------------------
> 
> On 2018-09-09 19:24, Gheorghe-Teod Bercea wrote:
>> Hi Jonas,
>> 
>> Very legitimate observations but I think there may be a
>> misunderstanding:
>> 
>> The patterns which is given as an example in the paper:
>> 
>> #teams
>> {
>> 
>>   #parallel
>> 
>> }
>> 
>> This is actually just the first level of parallelism. The code in
>> between the parallel and the team directives is considered
>> "sequential" since only one thread per team executes that.
>> 
>> The kernels that have been evaluated actually contain either the
>> parallel+parallel or the parallel+simd patterns.
> 
> Hi Doru,
> 
> I've now carefully read trough the experiments and I can't find this.
> In
> particular, VI.d)
> "Our second version uses the target teams distribute directive to
> exploit outer parallelism across teams and the parallel for directive
> on
> an inner loop to exploit nested parallelism within a team."
> 
> Jonas
> 
>> The other example you give is:
>> 
>> #teams
>> {
>>   #parallel
>>   {
>>      #parallel
>>      {}
>>   }
>> }
>> 
>> The innermost parallel would use the 2nd level of parallelism (or
>> nested parallelism). You can also add this pattern to that list:
>> 
>> #teams
>> {
>>   #parallel
>>   {
>>      #simd
>>      {}
>>   }
>> }
>> 
>> More level 2 patterns:
>> 
>> # target teams parallel
>> # parallel
>> 
>> or
>> 
>> # target teams parallel
>> # simd
>> 
>> I hope this clarifies the experiments.
>> 
>> Thanks,
>> 
>> --Doru
>> 
>> From:        Jonas Hahnfeld <hahnjo at hahnjo.de>
>> To:        Gheorghe-Teod Bercea <Gheorghe-Teod.Bercea at ibm.com>
>> Cc:        Alexey Bataev <alexey.bataev at ibm.com>, Hal Finkel
>> <hfinkel at anl.gov>, openmp-dev at lists.llvm.org
>> Date:        09/09/2018 04:30 AM
>> Subject:        Re: [Openmp-dev] nested parallelism in
>> libomptarget-nvptx
>> 
>> -------------------------
>> 
>> Hi Doru,
>> 
>> thanks for the link. However I don't think these experiments looked
> at
>> 
>> "nested parallelism", at least not how I understand it.
>> According to the OpenMP standard, the following snippet shows two
>> parallel regions, the second one is "nested":
>> #pragma omp parallel // (1)
>> {
>>   #pragma omp parallel // (2)
>>   { }
>> }
>> Assuming that (1) is the outermost parallel region (ie it's not
> nested
>> 
>> in yet another parallel) it will probably be "active", meaning that
>> it's
>> executed by more than one thread. For (2) the implementation can
>> decide
>> whether it supports "nested parallelism" or if it serializes the
>> parallel region (making it "inactive" in OpenMP's terminology).
>> 
>> For comparison the paper you linked evaluates implementations for
>> something like the following:
>> #pragma omp target teams // (1)
>> {
>>   #pragma omp parallel // (2)
>>   { }
>> }
>> This is different in that (2) is the first "parallel" region on the
>> device (even though it may be the second level of parallelism when
>> mapped onto a GPU). From my understanding Clang trunk already
> handles
>> this and I agree that this use case is important (see performance
>> comparison in the linked paper).
>> 
>> Back to my original question: Do we need to support "nested
>> parallelism"?
>> #pragma omp target teams // (1)
>> #pragma omp parallel // (2)
>> #pragma omp parallel // (3)
>> { }
>> This would be a third level of parallelism when executing on a GPU
> and
>> 
>> would require data sharing from worker threads of (2) (master
> threads
>> of
>> (3)) to worker threads of (3).
>> 
>> Thanks,
>> Jonas
>> 
>> On 2018-09-09 00:51, Gheorghe-Teod Bercea wrote:
>>> Hi Jonas,
>>> 
>>> The experiments are in this paper:
>>> https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767
> [1]
>> [1][1]
>>> 
>>> In case you can't access the paper I'm attaching it here.
>>> 
>>> Thanks,
>>> 
>>> --Doru
>>> 
>>> From:        Jonas Hahnfeld <hahnjo at hahnjo.de>
>>> To:        Gheorghe-Teod Bercea <Gheorghe-Teod.Bercea at ibm.com>
>>> Cc:        Hal Finkel <hfinkel at anl.gov>, Alexey Bataev
>>> <alexey.bataev at ibm.com>, openmp-dev at lists.llvm.org
>>> Date:        09/08/2018 05:10 AM
>>> Subject:        Re: [Openmp-dev] nested parallelism in
>>> libomptarget-nvptx
>>> 
>>> -------------------------
>>> 
>>> Hi Doru,
>>> 
>>> On 2018-09-07 22:03, Gheorghe-Teod Bercea wrote:
>>>> Hi Hal,
>>>> 
>>>> At least as far as we are aware, the number of use cases where the
>>>> nested parallel scheme would be used is quite small. Most of the
>> use
>>>> cases of OpenMP on GPUs have a single level of parallelism which
> is
>>>> typically SPMD-like to achieve as much performance as possible.
>> That
>>>> said there is some merit to having a nested parallelism scheme
>>> because
>>>> when it is helpful it typically is very helpful.
>>>> 
>>>> As a novelty point to ykt-clang I would suggest that whichever
>>> scheme
>>>> (or schemes) we decide to use, they should be applied only at the
>>>> request of the user. This is because we can do a better code gen
>> job
>>>> for more OpenMP patterns when using existing schemes (generic and
>>>> SPMD) if we know at compile time if there will be no second level
>>>> parallelism in use. This is due to some changes in implementation
>> in
>>>> trunk compared to ykt-clang.
>>> 
>>> I agree: Even then we may be able to restructure the application to
>> be
>>> 
>>> more performant and portable without nested parallelism.
>>> 
>>>> Regarding which scheme to use there were two which were floated
>>> around
>>>> based on discussions with users: (1) the current scheme in
>> ykt-clang
>>>> which enables the code in both inner and outer parallel loops to
> be
>>>> executed in parallel and (2) a scheme where the outer loop code is
>>>> executed by one thread and the innermost loop is executed by all
>>>> threads (this was requested by users at one point, I assume this
> is
>>>> still the case).
>>>> 
>>>> Since ykt-clang only supports the fist scheme when we ran
>>> performance
>>>> tests comparing nested parallelism against no nested parallelism
> we
>>>> got anywhere from 4x slowdown to 32x speedup depending on the:
>> ratio
>>>> of outer:inner iterations, the work size in the innermost loop,
>>>> reductions, atomics and memory coalescing. About 80% of the number
>>> of
>>>> cases we tried showed speed-ups with some showing significant
>>>> speed-ups.
>>>> I would very much be in favour of having at least this scheme
>>>> supported since it looks like it could be useful.
>>> 
>>> Interesting. Are these experiments public? I'd be interested to see
>>> the
>>> codes that benefit from nested parallelism.
>>> IIRC OpenACC doesn't have this feature, so I expect this to be
>> corner
>>> cases.
>>> 
>>> Regards,
>>> Jonas
>>> 
>>> 
>>> 
>>> Links:
>>> ------
>>> [1]
>>> https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767
> [1][2]
>> 
>> 
>> 
>> Links:
>> ------
>> [1] https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767
> [1]
>> [2]
>> https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767 [2]
> 
> 
> 
> Links:
> ------
> [1] https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767
> [2] 
> https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767