<font size=2 face="sans-serif">Hi Jonas,</font><br><br><font size=2 face="sans-serif">Very legitimate observations but I think
there may be a misunderstanding:</font><br><br><font size=2 face="sans-serif">The patterns which is given as an example
in the paper:</font><br><br><font size=2 face="sans-serif">#teams</font><br><font size=2 face="sans-serif">{</font><br><br><font size=2 face="sans-serif"> #parallel</font><br><font size=2 face="sans-serif"> </font><br><font size=2 face="sans-serif">}</font><br><br><font size=2 face="sans-serif">This is actually just the first level
of parallelism. The code in between the parallel and the team directives
is considered "sequential" since only one thread per team executes
that.</font><br><br><font size=2 face="sans-serif">The kernels that have been evaluated
actually contain either the parallel+parallel or the parallel+simd patterns.</font><br><br><font size=2 face="sans-serif">The other example you give is:</font><br><br><font size=2 face="sans-serif">#teams</font><br><font size=2 face="sans-serif">{</font><br><font size=2 face="sans-serif"> #parallel</font><br><font size=2 face="sans-serif"> {</font><br><font size=2 face="sans-serif"> #parallel</font><br><font size=2 face="sans-serif"> {}</font><br><font size=2 face="sans-serif"> }</font><br><font size=2 face="sans-serif">}</font><br><br><font size=2 face="sans-serif">The innermost parallel would use the
2nd level of parallelism (or nested parallelism). You can also add this
pattern to that list:</font><br><br><font size=2 face="sans-serif">#teams</font><br><font size=2 face="sans-serif">{</font><br><font size=2 face="sans-serif"> #parallel</font><br><font size=2 face="sans-serif"> {</font><br><font size=2 face="sans-serif"> #simd</font><br><font size=2 face="sans-serif"> {}</font><br><font size=2 face="sans-serif"> }</font><br><font size=2 face="sans-serif">}</font><br><br><font size=2 face="sans-serif">More level 2 patterns:</font><br><br><font size=2 face="sans-serif"># target teams parallel</font><br><font size=2 face="sans-serif"># parallel</font><br><br><font size=2 face="sans-serif">or</font><br><br><font size=2 face="sans-serif"># target teams parallel</font><br><font size=2 face="sans-serif"># simd</font><br><br><font size=2 face="sans-serif">I hope this clarifies the experiments.</font><br><br><font size=2 face="sans-serif">Thanks,</font><br><br><font size=2 face="sans-serif">--Doru</font><br><br><br><br><br><font size=2 face="sans-serif"><br></font><br><br><br><br><font size=1 color=#5f5f5f face="sans-serif">From:
</font><font size=1 face="sans-serif">Jonas Hahnfeld <hahnjo@hahnjo.de></font><br><font size=1 color=#5f5f5f face="sans-serif">To:
</font><font size=1 face="sans-serif">Gheorghe-Teod Bercea
<Gheorghe-Teod.Bercea@ibm.com></font><br><font size=1 color=#5f5f5f face="sans-serif">Cc:
</font><font size=1 face="sans-serif">Alexey Bataev <alexey.bataev@ibm.com>,
Hal Finkel <hfinkel@anl.gov>, openmp-dev@lists.llvm.org</font><br><font size=1 color=#5f5f5f face="sans-serif">Date:
</font><font size=1 face="sans-serif">09/09/2018 04:30 AM</font><br><font size=1 color=#5f5f5f face="sans-serif">Subject:
</font><font size=1 face="sans-serif">Re: [Openmp-dev]
nested parallelism in libomptarget-nvptx</font><br><hr noshade><br><br><br><tt><font size=2>Hi Doru,<br><br>thanks for the link. However I don't think these experiments looked at
<br>"nested parallelism", at least not how I understand it.<br>According to the OpenMP standard, the following snippet shows two <br>parallel regions, the second one is "nested":<br>#pragma omp parallel // (1)<br>{<br> #pragma omp parallel // (2)<br> { }<br>}<br>Assuming that (1) is the outermost parallel region (ie it's not nested
<br>in yet another parallel) it will probably be "active", meaning
that it's <br>executed by more than one thread. For (2) the implementation can decide
<br>whether it supports "nested parallelism" or if it serializes
the <br>parallel region (making it "inactive" in OpenMP's terminology).<br><br>For comparison the paper you linked evaluates implementations for <br>something like the following:<br>#pragma omp target teams // (1)<br>{<br> #pragma omp parallel // (2)<br> { }<br>}<br>This is different in that (2) is the first "parallel" region
on the <br>device (even though it may be the second level of parallelism when <br>mapped onto a GPU). From my understanding Clang trunk already handles <br>this and I agree that this use case is important (see performance <br>comparison in the linked paper).<br><br>Back to my original question: Do we need to support "nested <br>parallelism"?<br>#pragma omp target teams // (1)<br>#pragma omp parallel // (2)<br>#pragma omp parallel // (3)<br>{ }<br>This would be a third level of parallelism when executing on a GPU and
<br>would require data sharing from worker threads of (2) (master threads of
<br>(3)) to worker threads of (3).<br><br>Thanks,<br>Jonas<br><br>On 2018-09-09 00:51, Gheorghe-Teod Bercea wrote:<br>> Hi Jonas,<br>> <br>> The experiments are in this paper:<br>> </font></tt><a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767"><tt><font size=2>https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767</font></tt></a><tt><font size=2>[1]<br>> <br>> In case you can't access the paper I'm attaching it here.<br>> <br>> Thanks,<br>> <br>> --Doru<br>> <br>> From: Jonas Hahnfeld <hahnjo@hahnjo.de><br>> To: Gheorghe-Teod Bercea <Gheorghe-Teod.Bercea@ibm.com><br>> Cc: Hal Finkel <hfinkel@anl.gov>,
Alexey Bataev<br>> <alexey.bataev@ibm.com>, openmp-dev@lists.llvm.org<br>> Date: 09/08/2018 05:10 AM<br>> Subject: Re: [Openmp-dev] nested parallelism
in<br>> libomptarget-nvptx<br>> <br>> -------------------------<br>> <br>> Hi Doru,<br>> <br>> On 2018-09-07 22:03, Gheorghe-Teod Bercea wrote:<br>>> Hi Hal,<br>>> <br>>> At least as far as we are aware, the number of use cases where
the<br>>> nested parallel scheme would be used is quite small. Most of the
use<br>>> cases of OpenMP on GPUs have a single level of parallelism which
is<br>>> typically SPMD-like to achieve as much performance as possible.
That<br>>> said there is some merit to having a nested parallelism scheme<br>> because<br>>> when it is helpful it typically is very helpful.<br>>> <br>>> As a novelty point to ykt-clang I would suggest that whichever<br>> scheme<br>>> (or schemes) we decide to use, they should be applied only at
the<br>>> request of the user. This is because we can do a better code gen
job<br>>> for more OpenMP patterns when using existing schemes (generic
and<br>>> SPMD) if we know at compile time if there will be no second level<br>>> parallelism in use. This is due to some changes in implementation
in<br>>> trunk compared to ykt-clang.<br>> <br>> I agree: Even then we may be able to restructure the application to
be<br>> <br>> more performant and portable without nested parallelism.<br>> <br>>> Regarding which scheme to use there were two which were floated<br>> around<br>>> based on discussions with users: (1) the current scheme in ykt-clang<br>>> which enables the code in both inner and outer parallel loops
to be<br>>> executed in parallel and (2) a scheme where the outer loop code
is<br>>> executed by one thread and the innermost loop is executed by all<br>>> threads (this was requested by users at one point, I assume this
is<br>>> still the case).<br>>> <br>>> Since ykt-clang only supports the fist scheme when we ran<br>> performance<br>>> tests comparing nested parallelism against no nested parallelism
we<br>>> got anywhere from 4x slowdown to 32x speedup depending on the:
ratio<br>>> of outer:inner iterations, the work size in the innermost loop,<br>>> reductions, atomics and memory coalescing. About 80% of the number<br>> of<br>>> cases we tried showed speed-ups with some showing significant<br>>> speed-ups.<br>>> I would very much be in favour of having at least this scheme<br>>> supported since it looks like it could be useful.<br>> <br>> Interesting. Are these experiments public? I'd be interested to see<br>> the<br>> codes that benefit from nested parallelism.<br>> IIRC OpenACC doesn't have this feature, so I expect this to be corner<br>> cases.<br>> <br>> Regards,<br>> Jonas<br>> <br>> <br>> <br>> Links:<br>> ------<br>> [1] <br>> </font></tt><a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767"><tt><font size=2>https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767</font></tt></a><tt><font size=2><br><br></font></tt><br><br><BR>