[Openmp-dev] nested parallelism in libomptarget-nvptx
Jonas Hahnfeld via Openmp-dev
openmp-dev at lists.llvm.org
Sun Sep 9 13:52:08 PDT 2018
P.S.: Maybe you can share the source code of a benchmark that uses
nested parallel regions / 3 levels of parallelism?
On 2018-09-09 21:27, Jonas Hahnfeld via Openmp-dev wrote:
> On 2018-09-09 19:24, Gheorghe-Teod Bercea wrote:
>> Hi Jonas,
>>
>> Very legitimate observations but I think there may be a
>> misunderstanding:
>>
>> The patterns which is given as an example in the paper:
>>
>> #teams
>> {
>>
>> #parallel
>>
>> }
>>
>> This is actually just the first level of parallelism. The code in
>> between the parallel and the team directives is considered
>> "sequential" since only one thread per team executes that.
>>
>> The kernels that have been evaluated actually contain either the
>> parallel+parallel or the parallel+simd patterns.
>
> Hi Doru,
>
> I've now carefully read trough the experiments and I can't find this.
> In particular, VI.d)
> "Our second version uses the target teams distribute directive to
> exploit outer parallelism across teams and the parallel for directive
> on an inner loop to exploit nested parallelism within a team."
>
> Jonas
>
>> The other example you give is:
>>
>> #teams
>> {
>> #parallel
>> {
>> #parallel
>> {}
>> }
>> }
>>
>> The innermost parallel would use the 2nd level of parallelism (or
>> nested parallelism). You can also add this pattern to that list:
>>
>> #teams
>> {
>> #parallel
>> {
>> #simd
>> {}
>> }
>> }
>>
>> More level 2 patterns:
>>
>> # target teams parallel
>> # parallel
>>
>> or
>>
>> # target teams parallel
>> # simd
>>
>> I hope this clarifies the experiments.
>>
>> Thanks,
>>
>> --Doru
>>
>> From: Jonas Hahnfeld <hahnjo at hahnjo.de>
>> To: Gheorghe-Teod Bercea <Gheorghe-Teod.Bercea at ibm.com>
>> Cc: Alexey Bataev <alexey.bataev at ibm.com>, Hal Finkel
>> <hfinkel at anl.gov>, openmp-dev at lists.llvm.org
>> Date: 09/09/2018 04:30 AM
>> Subject: Re: [Openmp-dev] nested parallelism in
>> libomptarget-nvptx
>>
>> -------------------------
>>
>> Hi Doru,
>>
>> thanks for the link. However I don't think these experiments looked at
>>
>> "nested parallelism", at least not how I understand it.
>> According to the OpenMP standard, the following snippet shows two
>> parallel regions, the second one is "nested":
>> #pragma omp parallel // (1)
>> {
>> #pragma omp parallel // (2)
>> { }
>> }
>> Assuming that (1) is the outermost parallel region (ie it's not nested
>>
>> in yet another parallel) it will probably be "active", meaning that
>> it's
>> executed by more than one thread. For (2) the implementation can
>> decide
>> whether it supports "nested parallelism" or if it serializes the
>> parallel region (making it "inactive" in OpenMP's terminology).
>>
>> For comparison the paper you linked evaluates implementations for
>> something like the following:
>> #pragma omp target teams // (1)
>> {
>> #pragma omp parallel // (2)
>> { }
>> }
>> This is different in that (2) is the first "parallel" region on the
>> device (even though it may be the second level of parallelism when
>> mapped onto a GPU). From my understanding Clang trunk already handles
>> this and I agree that this use case is important (see performance
>> comparison in the linked paper).
>>
>> Back to my original question: Do we need to support "nested
>> parallelism"?
>> #pragma omp target teams // (1)
>> #pragma omp parallel // (2)
>> #pragma omp parallel // (3)
>> { }
>> This would be a third level of parallelism when executing on a GPU and
>>
>> would require data sharing from worker threads of (2) (master threads
>> of
>> (3)) to worker threads of (3).
>>
>> Thanks,
>> Jonas
>>
>> On 2018-09-09 00:51, Gheorghe-Teod Bercea wrote:
>>> Hi Jonas,
>>>
>>> The experiments are in this paper:
>>> https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767
>> [1][1]
>>>
>>> In case you can't access the paper I'm attaching it here.
>>>
>>> Thanks,
>>>
>>> --Doru
>>>
>>> From: Jonas Hahnfeld <hahnjo at hahnjo.de>
>>> To: Gheorghe-Teod Bercea <Gheorghe-Teod.Bercea at ibm.com>
>>> Cc: Hal Finkel <hfinkel at anl.gov>, Alexey Bataev
>>> <alexey.bataev at ibm.com>, openmp-dev at lists.llvm.org
>>> Date: 09/08/2018 05:10 AM
>>> Subject: Re: [Openmp-dev] nested parallelism in
>>> libomptarget-nvptx
>>>
>>> -------------------------
>>>
>>> Hi Doru,
>>>
>>> On 2018-09-07 22:03, Gheorghe-Teod Bercea wrote:
>>>> Hi Hal,
>>>>
>>>> At least as far as we are aware, the number of use cases where the
>>>> nested parallel scheme would be used is quite small. Most of the
>> use
>>>> cases of OpenMP on GPUs have a single level of parallelism which is
>>>> typically SPMD-like to achieve as much performance as possible.
>> That
>>>> said there is some merit to having a nested parallelism scheme
>>> because
>>>> when it is helpful it typically is very helpful.
>>>>
>>>> As a novelty point to ykt-clang I would suggest that whichever
>>> scheme
>>>> (or schemes) we decide to use, they should be applied only at the
>>>> request of the user. This is because we can do a better code gen
>> job
>>>> for more OpenMP patterns when using existing schemes (generic and
>>>> SPMD) if we know at compile time if there will be no second level
>>>> parallelism in use. This is due to some changes in implementation
>> in
>>>> trunk compared to ykt-clang.
>>>
>>> I agree: Even then we may be able to restructure the application to
>> be
>>>
>>> more performant and portable without nested parallelism.
>>>
>>>> Regarding which scheme to use there were two which were floated
>>> around
>>>> based on discussions with users: (1) the current scheme in
>> ykt-clang
>>>> which enables the code in both inner and outer parallel loops to be
>>>> executed in parallel and (2) a scheme where the outer loop code is
>>>> executed by one thread and the innermost loop is executed by all
>>>> threads (this was requested by users at one point, I assume this is
>>>> still the case).
>>>>
>>>> Since ykt-clang only supports the fist scheme when we ran
>>> performance
>>>> tests comparing nested parallelism against no nested parallelism we
>>>> got anywhere from 4x slowdown to 32x speedup depending on the:
>> ratio
>>>> of outer:inner iterations, the work size in the innermost loop,
>>>> reductions, atomics and memory coalescing. About 80% of the number
>>> of
>>>> cases we tried showed speed-ups with some showing significant
>>>> speed-ups.
>>>> I would very much be in favour of having at least this scheme
>>>> supported since it looks like it could be useful.
>>>
>>> Interesting. Are these experiments public? I'd be interested to see
>>> the
>>> codes that benefit from nested parallelism.
>>> IIRC OpenACC doesn't have this feature, so I expect this to be
>> corner
>>> cases.
>>>
>>> Regards,
>>> Jonas
>>>
>>>
>>>
>>> Links:
>>> ------
>>> [1]
>>> https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767 [2]
>>
>>
>>
>> Links:
>> ------
>> [1] https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767
>> [2]
>> https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767
> _______________________________________________
> Openmp-dev mailing list
> Openmp-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
More information about the Openmp-dev
mailing list