<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Not sure about the API, most probably just some internal work is
required. Better to ask Alex Eichenberger, he knows more about
this.<br>
</p>
<pre class="moz-signature" cols="72">-------------
Best regards,
Alexey Bataev</pre>
<div class="moz-cite-prefix">31.10.2019 4:36 PM, Finkel, Hal J.
пишет:<br>
</div>
<blockquote type="cite"
cite="mid:6a4feb16-65dc-a337-8eda-aeca5f46d6f0@anl.gov">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<p><br>
</p>
<div class="moz-cite-prefix">On 10/31/19 3:06 PM, Alexey Bataev
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:VI1PR09MB39821ED7EB17C852556C67E296630@VI1PR09MB3982.eurprd09.prod.outlook.com">
<p>Hope to send this message from the main dev e-mail this time
:)</p>
<p><br>
</p>
<p>Well, about the memory. It depends on the number of kernels
you have. All the memory in the kernels that must be
globalized is squashed into a union. With streams we need to
use the separate structure for each particular kernel. Plus,
we cannot use shared memory for this buffer anymore again
because of possible conflict. <br>
</p>
<p><br>
</p>
<p>We can add a new compiler option to compile only some files
with streams support and use unique memory buffer for the
globalized variables. Plus, some work in the libomptarget is
required, of course.<br>
</p>
</blockquote>
<p><br>
</p>
<p>Do we also need some kind of libomptarget API change in order
to communicate the fact that it's allowed to run multiple target
regions concurrently?</p>
<p><br>
</p>
<p>Thanks again,</p>
<p>Hal<br>
</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:VI1PR09MB39821ED7EB17C852556C67E296630@VI1PR09MB3982.eurprd09.prod.outlook.com">
<p><br>
</p>
<pre class="moz-signature" cols="72">-------------
Best regards,
Alexey Bataev</pre>
<div class="moz-cite-prefix">31.10.2019 3:58 PM, Finkel, Hal J.
пишет:<br>
</div>
<blockquote type="cite"
cite="mid:dbb670c9-376e-aae6-17af-c363afa52960@anl.gov">
<p><br>
</p>
<div class="moz-cite-prefix">On 10/31/19 10:54 AM, Luo, Ye
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:DM6PR09MB3548BF5277EBEC67300B8F36A3630@DM6PR09MB3548.namprd09.prod.outlook.com">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
<div style="font-family: Calibri, Arial, Helvetica,
sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Hi Hal,</div>
<div style="font-family: Calibri, Arial, Helvetica,
sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
My experience of llvm/clang so far shows:</div>
<div style="font-family: Calibri, Arial, Helvetica,
sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
1. all the target offload is blocking synchronous using
the default stream. nowait is not supported.</div>
<div style="font-family: Calibri, Arial, Helvetica,
sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
2. all the memory transfer calls invoke cudaMemcpy. There
are no async calls.</div>
<div style="font-family: Calibri, Arial, Helvetica,
sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
3. I had an experiment in the past turning on <samp>CUDA_API_PER_THREAD_DEFAULT_STREAM</samp>
in libomptarget.</div>
<div style="font-family: Calibri, Arial, Helvetica,
sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Then I use multiple host threads to do individual blocking
synchronous offload. I got it sort of running and saw
multple streams but the code crashes due to memory
corruption probably due to some data race in libomptarget.</div>
</blockquote>
<p><br>
</p>
<p>Thanks, Ye. That's consistent with Alexey's comments.</p>
<p><br>
</p>
<p>Is there already a bug open on this? If not, we should open
one.</p>
<p><br>
</p>
<p>Alexey, the buffer-reuse optimizations in Clang that you
mentioned, how much memory/overhead do they save? Is it
worth keeping them in some mode?</p>
<p><br>
</p>
<p> -Hal<br>
</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:DM6PR09MB3548BF5277EBEC67300B8F36A3630@DM6PR09MB3548.namprd09.prod.outlook.com">
<div style="font-family: Calibri, Arial, Helvetica,
sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Best,</div>
<div style="font-family: Calibri, Arial, Helvetica,
sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Ye<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica,
sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font
style="font-size:11pt" face="Calibri, sans-serif"
color="#000000"><b>From:</b> Finkel, Hal J.
<a class="moz-txt-link-rfc2396E"
href="mailto:hfinkel@anl.gov" moz-do-not-send="true">
<hfinkel@anl.gov></a><br>
<b>Sent:</b> Wednesday, October 30, 2019 1:40 PM<br>
<b>To:</b> Alessandro Gabbana <a
class="moz-txt-link-rfc2396E"
href="mailto:gbblsn@unife.it" moz-do-not-send="true">
<gbblsn@unife.it></a>; <a
class="moz-txt-link-abbreviated"
href="mailto:cfe-dev@lists.llvm.org"
moz-do-not-send="true">
cfe-dev@lists.llvm.org</a> <a
class="moz-txt-link-rfc2396E"
href="mailto:cfe-dev@lists.llvm.org"
moz-do-not-send="true">
<cfe-dev@lists.llvm.org></a>; Luo, Ye <a
class="moz-txt-link-rfc2396E"
href="mailto:yeluo@anl.gov" moz-do-not-send="true">
<yeluo@anl.gov></a>; Doerfert, Johannes <a
class="moz-txt-link-rfc2396E"
href="mailto:jdoerfert@anl.gov" moz-do-not-send="true">
<jdoerfert@anl.gov></a><br>
<b>Subject:</b> Re: [cfe-dev] openmp 4.5 and cuda
streams</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span
style="font-size:11pt;">
<div class="PlainText">[+Ye, Johannes]<br>
<br>
I recall that we've also observed this behavior. Ye,
Johannes, we had a <br>
work-around and a patch, correct?<br>
<br>
-Hal<br>
<br>
On 10/30/19 12:28 PM, Alessandro Gabbana via cfe-dev
wrote:<br>
> Dear All,<br>
><br>
> I'm using clang 9.0.0 to compile a code which
offloads sections of a <br>
> code on a GPU using the openmp target
construct.<br>
> I also use the nowait clause to overlap the
execution of certain <br>
> kernels and/or host<->device memory
transfers.<br>
> However, using the nvidia profiler I've noticed
that when I compile <br>
> the code with clang only one cuda stream is
active,<br>
> and therefore the execution gets serialized. On
the other hand, when <br>
> compiling with XLC I see that kernels are
executed<br>
> on different streams. I could not understand if
this is the expected <br>
> behavior (e.g. the nowait clause is currently
not supported),<br>
> or if I'm missing something. I'm using a NVIDIA
Tesla P100 GPU and <br>
> compiling with the following options:<br>
><br>
> -target x86_64-pc-linux-gnu -fopenmp <br>
> -fopenmp-targets=nvptx64-nvidia-cuda <br>
> -Xopenmp-target=nvptx64-nvidia-cuda
-march=sm_60<br>
><br>
> best wishes<br>
><br>
> Alessandro<br>
><br>
> _______________________________________________<br>
> cfe-dev mailing list<br>
> <a class="moz-txt-link-abbreviated"
href="mailto:cfe-dev@lists.llvm.org"
moz-do-not-send="true">
cfe-dev@lists.llvm.org</a><br>
> <a
href="https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev"
moz-do-not-send="true">
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>
<br>
-- <br>
Hal Finkel<br>
Lead, Compiler Technology and Programming Languages<br>
Leadership Computing Facility<br>
Argonne National Laboratory<br>
<br>
</div>
</span></font></div>
</blockquote>
<pre class="moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</blockquote>
</blockquote>
<pre class="moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</blockquote>
</body>
</html>