[Openmp-dev] nested parallelism in libomptarget-nvptx

Mon Sep 10 09:44:43 PDT 2018

On 2018-09-10 17:05, Gheorghe-Teod Bercea wrote:
> Hi Jonas,
> 
> You have to remember that clang-ykt can decide to use more efficient
> code generation schemes when it deems it safe to do so. For example,
> SPMD mode versus generic mode. Generic mode will use the master-worker
> scheme whereas SPMD will just have all threads do the same thing thus
> avoiding the master-worker scheme completely.
> 
> The activation of all threads in those two regions was regarded as an
> optimization. It is always safe to activate all threads if the code in
> the teams distribute only region does not contain side effects. For
> example if all you're doing is declaring some local variables you can
> go ahead and run fully parallel. This was kind of like an SPMD-ization
> of the nested parallelism code. Doing it this way is a lot faster
> since you don't have to use the master-worker scheme for the first
> level of parallelism which has an overhead that the experiments aim to
> avoid.
> 
> In your second comment you are now circling back to exactly the point
> I made at the start of the first e-mail I sent when I was talking
> about the limited number of use cases for nested parallelism. The
> pattern you're really asking for (with the separate teams distribute)
> I don't have any benchmarks to suggest for that one (this doesn't mean
> that someone somewhere doesn't have one).
> 
> Remember that you can combine directives so there's no need to have a
> separate teams distribute. These patterns are far more common:
> 
> #pragma omp target teams distribute parallel for
> {
>    // all threads active
>    # parallel for
>    {
>        // all threads active - second level parallelism
>    }
> }
> 
> or like this:
> 
> #pragma omp target teams distribute parallel for
> {
>    // all threads active
>    # simd
>    {
>        // all threads active - second level parallelism
>    }
> }

So if they are common, do you have benchmarks that use them? Is it 
possible to make some of the codes public, please?

Jonas

> 
> Thanks,
> 
> --Doru
> 
> From:        Jonas Hahnfeld <hahnjo at hahnjo.de>
> To:        Gheorghe-Teod Bercea <Gheorghe-Teod.Bercea at ibm.com>
> Cc:        Alexey Bataev <alexey.bataev at ibm.com>, Hal Finkel
> <hfinkel at anl.gov>, openmp-dev at lists.llvm.org
> Date:        09/10/2018 10:28 AM
> Subject:        Re: [Openmp-dev] nested parallelism in
> libomptarget-nvptx
> 
> -------------------------
> 
> Hi Doru,
> 
> On 2018-09-10 15:32, Gheorghe-Teod Bercea wrote:
>> Hi Jonas,
>> 
>> The experiments in the paper that are under the nested parallelism
>> section really do use the nested parallelism scheme. "teams
>> distribute" activated all the threads in the team.
> 
> I disagree: Only the team master executes the loop body of a "teams
> distribute" region. CUDA activates all (CUDA) threads at kernel
> launch,
> but that's really not the point.
> 
>> Nested parallelism is activated every time you have an outer region
>> with all threads active, calling an inner region that needs to have
>> all threads active. No matter which directives you assign the second
>> level parallelism to, the scheme for it will use the warp-wise
>> execution.
>> 
>> If  you have:
>> 
>> #target teams distribute
>> {
>>     // all threads active
> 
> This looks like an error? It's the same directive as below, but
> exhibits
> a different behavior?
> 
>>     # parallel for
>>     {
>>         // all threads active - this uses nested parallelism since
> it
>> was called from a region where all threads were active
>>     }
>> }
>> 
>> # target teams distribute
>> {
>>      // one thread per team active
>>      # parallel for
>>      {
>>         // all threads active
>>         # parallel for
>>         {
>>             // all threads active - this uses nested parallelism
> since
>> it was called from a region where all thread are active
>>         }
>>      }
>> }
> 
> That's the pattern I'm looking for. Can you link me to a benchmark
> that
> uses this scheme?
> 
> Jonas