<font size=2 face="sans-serif">Hi Jonas,</font><br><br><font size=2 face="sans-serif">Very legitimate observations but I think

there may be a misunderstanding:</font><br><br><font size=2 face="sans-serif">The patterns which is given as an example

in the paper:</font><br><br><font size=2 face="sans-serif">#teams</font><br><font size=2 face="sans-serif">{</font><br><br><font size=2 face="sans-serif">  #parallel</font><br><font size=2 face="sans-serif">  </font><br><font size=2 face="sans-serif">}</font><br><br><font size=2 face="sans-serif">This is actually just the first level

of parallelism. The code in between the parallel and the team directives

is considered "sequential" since only one thread per team executes

that.</font><br><br><font size=2 face="sans-serif">The kernels that have been evaluated

actually contain either the parallel+parallel or the parallel+simd patterns.</font><br><br><font size=2 face="sans-serif">The other example you give is:</font><br><br><font size=2 face="sans-serif">#teams</font><br><font size=2 face="sans-serif">{</font><br><font size=2 face="sans-serif">  #parallel</font><br><font size=2 face="sans-serif">  {</font><br><font size=2 face="sans-serif">     #parallel</font><br><font size=2 face="sans-serif">     {}</font><br><font size=2 face="sans-serif">  }</font><br><font size=2 face="sans-serif">}</font><br><br><font size=2 face="sans-serif">The innermost parallel would use the

2nd level of parallelism (or nested parallelism). You can also add this

pattern to that list:</font><br><br><font size=2 face="sans-serif">#teams</font><br><font size=2 face="sans-serif">{</font><br><font size=2 face="sans-serif">  #parallel</font><br><font size=2 face="sans-serif">  {</font><br><font size=2 face="sans-serif">     #simd</font><br><font size=2 face="sans-serif">     {}</font><br><font size=2 face="sans-serif">  }</font><br><font size=2 face="sans-serif">}</font><br><br><font size=2 face="sans-serif">More level 2 patterns:</font><br><br><font size=2 face="sans-serif"># target teams parallel</font><br><font size=2 face="sans-serif"># parallel</font><br><br><font size=2 face="sans-serif">or</font><br><br><font size=2 face="sans-serif"># target teams parallel</font><br><font size=2 face="sans-serif"># simd</font><br><br><font size=2 face="sans-serif">I hope this clarifies the experiments.</font><br><br><font size=2 face="sans-serif">Thanks,</font><br><br><font size=2 face="sans-serif">--Doru</font><br><br><br><br><br><font size=2 face="sans-serif"><br></font><br><br><br><br><font size=1 color=#5f5f5f face="sans-serif">From:      

 </font><font size=1 face="sans-serif">Jonas Hahnfeld <hahnjo@hahnjo.de></font><br><font size=1 color=#5f5f5f face="sans-serif">To:      

 </font><font size=1 face="sans-serif">Gheorghe-Teod Bercea

<Gheorghe-Teod.Bercea@ibm.com></font><br><font size=1 color=#5f5f5f face="sans-serif">Cc:      

 </font><font size=1 face="sans-serif">Alexey Bataev <alexey.bataev@ibm.com>,

Hal Finkel <hfinkel@anl.gov>, openmp-dev@lists.llvm.org</font><br><font size=1 color=#5f5f5f face="sans-serif">Date:      

 </font><font size=1 face="sans-serif">09/09/2018 04:30 AM</font><br><font size=1 color=#5f5f5f face="sans-serif">Subject:    

   </font><font size=1 face="sans-serif">Re: [Openmp-dev]

nested parallelism in libomptarget-nvptx</font><br><hr noshade><br><br><br><tt><font size=2>Hi Doru,<br><br>thanks for the link. However I don't think these experiments looked at

<br>"nested parallelism", at least not how I understand it.<br>According to the OpenMP standard, the following snippet shows two <br>parallel regions, the second one is "nested":<br>#pragma omp parallel // (1)<br>{<br>   #pragma omp parallel // (2)<br>   { }<br>}<br>Assuming that (1) is the outermost parallel region (ie it's not nested

<br>in yet another parallel) it will probably be "active", meaning

that it's <br>executed by more than one thread. For (2) the implementation can decide

<br>whether it supports "nested parallelism" or if it serializes

the <br>parallel region (making it "inactive" in OpenMP's terminology).<br><br>For comparison the paper you linked evaluates implementations for <br>something like the following:<br>#pragma omp target teams // (1)<br>{<br>   #pragma omp parallel // (2)<br>   { }<br>}<br>This is different in that (2) is the first "parallel" region

on the <br>device (even though it may be the second level of parallelism when <br>mapped onto a GPU). From my understanding Clang trunk already handles <br>this and I agree that this use case is important (see performance <br>comparison in the linked paper).<br><br>Back to my original question: Do we need to support "nested <br>parallelism"?<br>#pragma omp target teams // (1)<br>#pragma omp parallel // (2)<br>#pragma omp parallel // (3)<br>{ }<br>This would be a third level of parallelism when executing on a GPU and

<br>would require data sharing from worker threads of (2) (master threads of

<br>(3)) to worker threads of (3).<br><br>Thanks,<br>Jonas<br><br>On 2018-09-09 00:51, Gheorghe-Teod Bercea wrote:<br>> Hi Jonas,<br>> <br>> The experiments are in this paper:<br>> </font></tt><a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767"><tt><font size=2>https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767</font></tt></a><tt><font size=2>[1]<br>> <br>> In case you can't access the paper I'm attaching it here.<br>> <br>> Thanks,<br>> <br>> --Doru<br>> <br>> From:        Jonas Hahnfeld <hahnjo@hahnjo.de><br>> To:        Gheorghe-Teod Bercea <Gheorghe-Teod.Bercea@ibm.com><br>> Cc:        Hal Finkel <hfinkel@anl.gov>,

Alexey Bataev<br>> <alexey.bataev@ibm.com>, openmp-dev@lists.llvm.org<br>> Date:        09/08/2018 05:10 AM<br>> Subject:        Re: [Openmp-dev] nested parallelism

in<br>> libomptarget-nvptx<br>> <br>> -------------------------<br>> <br>> Hi Doru,<br>> <br>> On 2018-09-07 22:03, Gheorghe-Teod Bercea wrote:<br>>> Hi Hal,<br>>> <br>>> At least as far as we are aware, the number of use cases where

the<br>>> nested parallel scheme would be used is quite small. Most of the

use<br>>> cases of OpenMP on GPUs have a single level of parallelism which

is<br>>> typically SPMD-like to achieve as much performance as possible.

That<br>>> said there is some merit to having a nested parallelism scheme<br>> because<br>>> when it is helpful it typically is very helpful.<br>>> <br>>> As a novelty point to ykt-clang I would suggest that whichever<br>> scheme<br>>> (or schemes) we decide to use, they should be applied only at

the<br>>> request of the user. This is because we can do a better code gen

job<br>>> for more OpenMP patterns when using existing schemes (generic

and<br>>> SPMD) if we know at compile time if there will be no second level<br>>> parallelism in use. This is due to some changes in implementation

in<br>>> trunk compared to ykt-clang.<br>> <br>> I agree: Even then we may be able to restructure the application to

be<br>> <br>> more performant and portable without nested parallelism.<br>> <br>>> Regarding which scheme to use there were two which were floated<br>> around<br>>> based on discussions with users: (1) the current scheme in ykt-clang<br>>> which enables the code in both inner and outer parallel loops

to be<br>>> executed in parallel and (2) a scheme where the outer loop code

is<br>>> executed by one thread and the innermost loop is executed by all<br>>> threads (this was requested by users at one point, I assume this

is<br>>> still the case).<br>>> <br>>> Since ykt-clang only supports the fist scheme when we ran<br>> performance<br>>> tests comparing nested parallelism against no nested parallelism

we<br>>> got anywhere from 4x slowdown to 32x speedup depending on the:

ratio<br>>> of outer:inner iterations, the work size in the innermost loop,<br>>> reductions, atomics and memory coalescing. About 80% of the number<br>> of<br>>> cases we tried showed speed-ups with some showing significant<br>>> speed-ups.<br>>> I would very much be in favour of having at least this scheme<br>>> supported since it looks like it could be useful.<br>> <br>> Interesting. Are these experiments public? I'd be interested to see<br>> the<br>> codes that benefit from nested parallelism.<br>> IIRC OpenACC doesn't have this feature, so I expect this to be corner<br>> cases.<br>> <br>> Regards,<br>> Jonas<br>> <br>> <br>> <br>> Links:<br>> ------<br>> [1] <br>> </font></tt><a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767"><tt><font size=2>https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767</font></tt></a><tt><font size=2><br><br></font></tt><br><br><BR>