<font size=2 face="sans-serif">Hi Hal,</font><br><br><font size=2 face="sans-serif">At least as far as we are aware, the

number of use cases where the nested parallel scheme would be used is quite

small. Most of the use cases of OpenMP on GPUs have a single level of parallelism

which is typically SPMD-like to achieve as much performance as possible.

That said there is some merit to having a nested parallelism scheme because

when it is helpful it typically is very helpful.</font><br><br><font size=2 face="sans-serif">As a novelty point to ykt-clang I would

suggest that whichever scheme (or schemes) we decide to use, they should

be applied only at the request of the user. This is because we can do a

better code gen job for more OpenMP patterns when using existing schemes

(generic and SPMD) if we know at compile time if there will be no second

level parallelism in use. This is due to some changes in implementation

in trunk compared to ykt-clang.</font><br><br><font size=2 face="sans-serif">Regarding which scheme to use there

were two which were floated around based on discussions with users: (1)

the current scheme in ykt-clang which enables the code in both inner and

outer parallel loops to be executed in parallel and (2) a scheme where

the outer loop code is executed by one thread and the innermost loop is

executed by all threads (this was requested by users at one point, I assume

this is still the case).</font><br><br><font size=2 face="sans-serif">Since ykt-clang only supports the fist

scheme when we ran performance tests comparing nested parallelism against

no nested parallelism we got anywhere from 4x slowdown to 32x speedup depending

on the: ratio of outer:inner iterations, the work size in the innermost

loop, reductions, atomics and memory coalescing. About 80% of the number

of cases we tried showed speed-ups with some showing significant speed-ups.</font><br><font size=2 face="sans-serif">I would very much be in favour of having

at least this scheme supported since it looks like it could be useful.</font><br><br><font size=2 face="sans-serif">In terms of timing, we are still tied

up with upstreaming at the moment so we won't be attempting a new code

generation scheme until we are feature complete on the current ones.</font><br><br><font size=2 face="sans-serif">Thanks,</font><br><br><font size=2 face="sans-serif">--Doru</font><br><br><br><br><br><font size=1 color=#5f5f5f face="sans-serif">From:      

 </font><font size=1 face="sans-serif">Hal Finkel <hfinkel@anl.gov></font><br><font size=1 color=#5f5f5f face="sans-serif">To:      

 </font><font size=1 face="sans-serif">Gheorghe-Teod Bercea

<Gheorghe-Teod.Bercea@ibm.com>, Jonas Hahnfeld <hahnjo@hahnjo.de></font><br><font size=1 color=#5f5f5f face="sans-serif">Cc:      

 </font><font size=1 face="sans-serif">Alexey Bataev <alexey.bataev@ibm.com>,

<openmp-dev@lists.llvm.org></font><br><font size=1 color=#5f5f5f face="sans-serif">Date:      

 </font><font size=1 face="sans-serif">09/07/2018 12:35 PM</font><br><font size=1 color=#5f5f5f face="sans-serif">Subject:    

   </font><font size=1 face="sans-serif">Re: [Openmp-dev]

nested parallelism in libomptarget-nvptx <hr noshade> <font size=3>Hi, Doru,<font size=3>What do you think we should do, upstream, for nested parallelism?

Would it be desirable to have a clang-ykt-like scheme? Something else?</font><p><font size=3>Thanks again,</font><p><font size=3>Hal</font><p><br><font size=3>On 09/07/2018 10:59 AM, Gheorghe-Teod Bercea via Openmp-dev

wrote:</font><br><font size=2 face="sans-serif">Hi Jonas,</font><font size=3><br></font><font size=2 face="sans-serif"><br>The second level of parallelism in clang-ykt uses a scheme where all the

threads in each warp cooperate to execute the workload of the  1st

thread in the warp then the 2nd and so on until the workload of each of

the 32 threads in the warp has been completed. The workload of each thread

is always executed by the full warp.<br>You are correct in trunk the additional memory that this scheme uses is

not required. For now we would like to keep this functionality in place

so it would be good if you could hide it behind a flag. This will allow

us to easily drop it in the future.</font><font size=3><br></font><font size=2 face="sans-serif"><br>Thanks a lot,</font><font size=3><br></font><font size=2 face="sans-serif"><br>--Doru<br></font><font size=3><br><br><br><br></font><font size=1 color=#5f5f5f face="sans-serif"><br>From:        </font><font size=1 face="sans-serif">Jonas

Hahnfeld </font><a href="mailto:hahnjo@hahnjo.de"><font size=1 color=blue face="sans-serif"><u><hahnjo@hahnjo.de></u></font></a><font size=1 color=#5f5f5f face="sans-serif"><br>To:        </font><a href="mailto:openmp-dev@lists.llvm.org"><font size=1 color=blue face="sans-serif"><u>openmp-dev@lists.llvm.org</u></font></a><font size=1 color=#5f5f5f face="sans-serif"><br>Cc:        </font><font size=1 face="sans-serif">Alexey

Bataev <a href="mailto:alexey.bataev@ibm.com"><font size=1 color=blue face="sans-serif"><alexey.bataev@ibm.com></a><font size=1 face="sans-serif">,

Doru Bercea <a href="mailto:gheorghe-teod.bercea@ibm.com"><font size=1 color=blue face="sans-serif"><gheorghe-teod.bercea@ibm.com></a><font size=1 face="sans-serif">,

Kelvin Li </font><a href="mailto:kli@ca.ibm.com"><font size=1 color=blue face="sans-serif"><u><kli@ca.ibm.com></u></font></a><font size=1 color=#5f5f5f face="sans-serif"><br>Date:        </font><font size=1 face="sans-serif">09/07/2018

11:31 AM</font><font size=1 color=#5f5f5f face="sans-serif"><br>Subject:        </font><font size=1 face="sans-serif">nested

parallelism in libomptarget-nvptx</font><font size=3><br></font><hr noshade><font size=3><br><br></font><tt><font size=2><br>Hi all,<br><br>I've started some cleanups in libomptarget-nvptx, the OpenMP runtime <br>implementation on Nvidia GPUs. The ultimate motivation is reducing the

<br>memory overhead: At the moment the runtime statically allocates ~660MiB

<br>of global memory. This amount can't be used by applications. This might

<br>not sound much, but wasting precious memory doesn't sound wise.<br>I found that a portion of 448MiB come from buffers for data sharing. In

<br>particular they appear to be so large because the code is prepared to <br>handle nested parallelism where every thread would be in the position to

<br>share data with its nested worker threads.<br>From what I've seen so far this doesn't seem to be necessary for Clang

<br>trunk: Nested parallel regions are serialized, so only the initial <br>thread needs to share data with one set of worker threads. That's in <br>line with comments saying that there is no support for nested <br>parallelism.<br><br>However I found that my test applications compiled with clang-ykt <br>support two levels of parallelism. My guess would be that this is <br>related to "convergent parallelism": parallel.cu explains that

this is <br>meant for a "team of threads in a warp only". And indeed, each

nested <br>parallel region seems to be executed by 32 threads.<br>I'm not really sure how this works because I seem to get one OpenMP <br>thread per CUDA thread in the outer parallel region. So where are the <br>nested worker threads coming from?<br><br>In any case: If my analysis is correct, I'd like to propose adding a <br>CMake flag which disables this (seemingly) legacy support [1]. That <br>would avoid the memory overhead for users of Clang trunk and enable <br>future optimizations (I think).<br>Thoughts, opinions?<br><br>Cheers,<br>Jonas<br><br><br>1: Provided that IBM still wants to keep the code and we can't just go

<br>ahead and drop it. I guess that this can happen at some point in time,

<br>but I'm not sure if we are in that position right now.<br></font></tt><font size=3><br><br><br><br><br></font><br><tt><font size=3>_______________________________________________<br>Openmp-dev mailing list<br></font></tt><a href="mailto:Openmp-dev@lists.llvm.org"><tt><font size=3 color=blue><u>Openmp-dev@lists.llvm.org</u></font></tt></a><tt><font size=3><br></font></tt><a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev"><tt><font size=3 color=blue><u>http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</u></font></tt></a><tt><font size=3><br></font></tt><br><br><tt><font size=3>-- <br>Hal Finkel<br>Lead, Compiler Technology and Programming Languages<br>Leadership Computing Facility<br>Argonne National Laboratory</font></tt><br><br><BR>