<font size=2 face="sans-serif">Hi Jonas,</font><br><br><font size=2 face="sans-serif">You have to remember that clang-ykt

can decide to use more efficient code generation schemes when it deems

it safe to do so. For example, SPMD mode versus generic mode. Generic mode

will use the master-worker scheme whereas SPMD will just have all threads

do the same thing thus avoiding the master-worker scheme completely.</font><br><br><font size=2 face="sans-serif">The activation of all threads in those

two regions was regarded as an optimization. It is always safe to activate

all threads if the code in the teams distribute only region does not contain

side effects. For example if all you're doing is declaring some local variables

you can go ahead and run fully parallel. This was kind of like an SPMD-ization

of the nested parallelism code. Doing it this way is a lot faster since

you don't have to use the master-worker scheme for the first level of parallelism

which has an overhead that the experiments aim to avoid.</font><br><br><font size=2 face="sans-serif">In your second comment you are now circling

back to exactly the point I made at the start of the first e-mail I sent

when I was talking about the limited number of use cases for nested parallelism.

The pattern you're really asking for (with the separate teams distribute)

I don't have any benchmarks to suggest for that one (this doesn't mean

that someone somewhere doesn't have one).</font><br><br><font size=2 face="sans-serif">Remember that you can combine directives

so there's no need to have a separate teams distribute. These patterns

are far more common:</font><br><br><font size=2 face="sans-serif">#pragma omp target teams distribute

parallel for</font><br><font size=2 face="sans-serif">{</font><br><font size=2 face="sans-serif">   // all threads active</font><br><font size=2 face="sans-serif">   # parallel for</font><br><font size=2 face="sans-serif">   {</font><br><font size=2 face="sans-serif">       // all threads

active - second level parallelism</font><br><font size=2 face="sans-serif">   }</font><br><font size=2 face="sans-serif">}</font><br><br><font size=2 face="sans-serif">or like this:</font><br><br><font size=2 face="sans-serif">#pragma omp target teams distribute

parallel for</font><br><font size=2 face="sans-serif">{</font><br><font size=2 face="sans-serif">   // all threads active</font><br><font size=2 face="sans-serif">   # simd</font><br><font size=2 face="sans-serif">   {</font><br><font size=2 face="sans-serif">       // all threads

active - second level parallelism</font><br><font size=2 face="sans-serif">   }</font><br><font size=2 face="sans-serif">}</font><br><br><font size=2 face="sans-serif">Thanks,</font><br><br><font size=2 face="sans-serif">--Doru<br></font><br><br><br><br><font size=1 color=#5f5f5f face="sans-serif">From:      

 </font><font size=1 face="sans-serif">Jonas Hahnfeld <hahnjo@hahnjo.de></font><br><font size=1 color=#5f5f5f face="sans-serif">To:      

 </font><font size=1 face="sans-serif">Gheorghe-Teod Bercea

<Gheorghe-Teod.Bercea@ibm.com></font><br><font size=1 color=#5f5f5f face="sans-serif">Cc:      

 </font><font size=1 face="sans-serif">Alexey Bataev <alexey.bataev@ibm.com>,

Hal Finkel <hfinkel@anl.gov>, openmp-dev@lists.llvm.org</font><br><font size=1 color=#5f5f5f face="sans-serif">Date:      

 </font><font size=1 face="sans-serif">09/10/2018 10:28 AM</font><br><font size=1 color=#5f5f5f face="sans-serif">Subject:    

   </font><font size=1 face="sans-serif">Re: [Openmp-dev]

nested parallelism in libomptarget-nvptx</font><br><hr noshade><br><br><br><tt><font size=2>Hi Doru,<br><br>On 2018-09-10 15:32, Gheorghe-Teod Bercea wrote:<br>> Hi Jonas,<br>> <br>> The experiments in the paper that are under the nested parallelism<br>> section really do use the nested parallelism scheme. "teams<br>> distribute" activated all the threads in the team.<br><br>I disagree: Only the team master executes the loop body of a "teams

<br>distribute" region. CUDA activates all (CUDA) threads at kernel launch,

<br>but that's really not the point.<br><br>> Nested parallelism is activated every time you have an outer region<br>> with all threads active, calling an inner region that needs to have<br>> all threads active. No matter which directives you assign the second<br>> level parallelism to, the scheme for it will use the warp-wise<br>> execution.<br>> <br>> If  you have:<br>> <br>> #target teams distribute<br>> {<br>>     // all threads active<br><br>This looks like an error? It's the same directive as below, but exhibits

<br>a different behavior?<br><br>>     # parallel for<br>>     {<br>>         // all threads active - this uses nested

parallelism since it<br>> was called from a region where all threads were active<br>>     }<br>> }<br>> <br>> # target teams distribute<br>> {<br>>      // one thread per team active<br>>      # parallel for<br>>      {<br>>         // all threads active<br>>         # parallel for<br>>         {<br>>             // all threads active -

this uses nested parallelism since<br>> it was called from a region where all thread are active<br>>         }<br>>      }<br>> }<br><br>That's the pattern I'm looking for. Can you link me to a benchmark that

<br>uses this scheme?<br><br>Jonas<br><br></font></tt><br><br><BR>