<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Hi, Doru,</p>
<p>What do you think we should do, upstream, for nested parallelism?
Would it be desirable to have a clang-ykt-like scheme? Something
else?</p>
<p>Thanks again,</p>
<p>Hal<br>
</p>
<br>
<div class="moz-cite-prefix">On 09/07/2018 10:59 AM, Gheorghe-Teod
Bercea via Openmp-dev wrote:<br>
</div>
<blockquote type="cite"
cite="mid:OF7699DBA4.02E1D3DB-ON00258301.0056D7AE-85258301.0057E1CB@notes.na.collabserv.com">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<font size="2" face="sans-serif">Hi Jonas,</font><br>
<br>
<font size="2" face="sans-serif">The second level of parallelism
in clang-ykt
uses a scheme where all the threads in each warp cooperate to
execute the
workload of the 1st thread in the warp then the 2nd and so on
until
the workload of each of the 32 threads in the warp has been
completed.
The workload of each thread is always executed by the full warp.</font><br>
<font size="2" face="sans-serif">You are correct in trunk the
additional
memory that this scheme uses is not required. For now we would
like to
keep this functionality in place so it would be good if you
could hide
it behind a flag. This will allow us to easily drop it in the
future.</font><br>
<br>
<font size="2" face="sans-serif">Thanks a lot,</font><br>
<br>
<font size="2" face="sans-serif">--Doru</font><br>
<font size="2" face="sans-serif"><br>
</font><br>
<br>
<br>
<br>
<font color="#5f5f5f" size="1" face="sans-serif">From:
</font><font size="1" face="sans-serif">Jonas Hahnfeld
<a class="moz-txt-link-rfc2396E" href="mailto:hahnjo@hahnjo.de"><hahnjo@hahnjo.de></a></font><br>
<font color="#5f5f5f" size="1" face="sans-serif">To:
</font><font size="1" face="sans-serif"><a class="moz-txt-link-abbreviated" href="mailto:openmp-dev@lists.llvm.org">openmp-dev@lists.llvm.org</a></font><br>
<font color="#5f5f5f" size="1" face="sans-serif">Cc:
</font><font size="1" face="sans-serif">Alexey Bataev
<a class="moz-txt-link-rfc2396E" href="mailto:alexey.bataev@ibm.com"><alexey.bataev@ibm.com></a>,
Doru Bercea <a class="moz-txt-link-rfc2396E" href="mailto:gheorghe-teod.bercea@ibm.com"><gheorghe-teod.bercea@ibm.com></a>, Kelvin Li
<a class="moz-txt-link-rfc2396E" href="mailto:kli@ca.ibm.com"><kli@ca.ibm.com></a></font><br>
<font color="#5f5f5f" size="1" face="sans-serif">Date:
</font><font size="1" face="sans-serif">09/07/2018 11:31 AM</font><br>
<font color="#5f5f5f" size="1" face="sans-serif">Subject:
</font><font size="1" face="sans-serif">nested parallelism
in libomptarget-nvptx</font><br>
<hr noshade="noshade"><br>
<br>
<br>
<tt><font size="2">Hi all,<br>
<br>
I've started some cleanups in libomptarget-nvptx, the OpenMP
runtime <br>
implementation on Nvidia GPUs. The ultimate motivation is
reducing the
<br>
memory overhead: At the moment the runtime statically
allocates ~660MiB
<br>
of global memory. This amount can't be used by applications.
This might
<br>
not sound much, but wasting precious memory doesn't sound
wise.<br>
I found that a portion of 448MiB come from buffers for data
sharing. In
<br>
particular they appear to be so large because the code is
prepared to <br>
handle nested parallelism where every thread would be in the
position to
<br>
share data with its nested worker threads.<br>
From what I've seen so far this doesn't seem to be necessary
for Clang
<br>
trunk: Nested parallel regions are serialized, so only the
initial <br>
thread needs to share data with one set of worker threads.
That's in <br>
line with comments saying that there is no support for nested
<br>
parallelism.<br>
<br>
However I found that my test applications compiled with
clang-ykt <br>
support two levels of parallelism. My guess would be that this
is <br>
related to "convergent parallelism": parallel.cu explains that
this is <br>
meant for a "team of threads in a warp only". And indeed, each
nested <br>
parallel region seems to be executed by 32 threads.<br>
I'm not really sure how this works because I seem to get one
OpenMP <br>
thread per CUDA thread in the outer parallel region. So where
are the <br>
nested worker threads coming from?<br>
<br>
In any case: If my analysis is correct, I'd like to propose
adding a <br>
CMake flag which disables this (seemingly) legacy support [1].
That <br>
would avoid the memory overhead for users of Clang trunk and
enable <br>
future optimizations (I think).<br>
Thoughts, opinions?<br>
<br>
Cheers,<br>
Jonas<br>
<br>
<br>
1: Provided that IBM still wants to keep the code and we can't
just go
<br>
ahead and drop it. I guess that this can happen at some point
in time,
<br>
but I'm not sure if we are in that position right now.<br>
<br>
</font></tt><br>
<br>
<br>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Openmp-dev mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Openmp-dev@lists.llvm.org">Openmp-dev@lists.llvm.org</a>
<a class="moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</body>
</html>