<div class="socmaildefaultfont" dir="ltr" style="font-family:Arial, Helvetica, sans-serif;font-size:10.5pt" ><div dir="ltr" >Hal,</div>

<div dir="ltr" > </div>

<div dir="ltr" >Supporting async for targets is not trivial. Right now, since everything is "blocking" a target can be easily separated into 5 tasks:</div>

<div dir="ltr" > </div>

<ol dir="ltr" >        <li>wait for dependences to resolve</li>        <li>perform all the copy to the device</li>        <li>execute the target </li>        <li>perform all the copy from device to host</li>        <li>resolve all dependences for other tasks</li></ol>

<div dir="ltr" >Going async ala LOMP has many advantages: all operations are asynchronous, and dependences from target to target tasks are directly enforced on the device. To us, this was the only way that users could effectively hide the high overhead of generating targets, by enqueuing many dependent target tasks on the host, to prime the pipe of targets on the devices.</div>

<div dir="ltr" > </div>

<div dir="ltr" >To do so, I believe you have to do the following.</div>

<ol dir="ltr" >        <li>wait for all the host dependences; dependences from other device targets are tabulated but not waited for</li>        <li>select a stream (of one stream from a dependent target, if any) and enqueue wait for event for all tabulated dependences</li>        <li>enqueue all copy to device on stream (or enqueue sync event for data currently being copied over by other targets)</li>        <li>enqueue computation on stream</li>        <li>enqueue all copy from device on stream (this is speculative, as ref count may increase by another target executed before the data is actually copied back, but it's legal)</li>        <li>cleanup

        <ol>                <li>blocking: wait for stream to be finished</li>                <li>non-blocking: have a callback from CUDA (which involve a separate thread) or have active polling by OpenMP threads when doing nothing and/or before doing a subsequent target task to determine when stream is finished</li>                <li>when 1 or 2 above are finished, cleanup the map data structures, resolve dependences for dependent tasks.</li>        </ol>        </li></ol>

<div dir="ltr" >This is compounded by the fact that async data movements are only performed with pinned memory, and any CUDA memory cannot be allocated directly as it is a synchronizing event. So runtime must handle it's own pool of device and pinned memory, which requires additional work in Steps 3, 5, and 6.3 above.</div>

<div dir="ltr" ><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial, Helvetica, sans-serif;font-size:10.5pt" ><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial, Helvetica, sans-serif;font-size:10.5pt" ><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial;font-size:10.5pt" ><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial;font-size:10.5pt" ><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial;font-size:10.5pt" ><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial;font-size:10.5pt" ><div dir="ltr" > </div>

<div dir="ltr" >To perform the cleanup in Step 6, you also need to cache all info associated with a target in a dedicated data structure.</div>

<div dir="ltr" > </div>

<div dir="ltr" >As you may have noticed, if you want some async to work, you have basically to treat all target as async;  the synchronous ones differ only by having an explicit wait in Step 6.1. So all this handling is in the critical path.</div>

<div dir="ltr" > </div>

<div dir="ltr" >You will also need to carefully managed CUDA events associated with any explicit data movements, as subsequent target operations may be dependent on an actual memory operation to complete (in either directions). </div>

<div dir="ltr" > </div>

<div dir="ltr" >This has been done in LOMP, was it fun, maybe not, but it's all feasible. </div>

<div dir="ltr" > </div>

<div dir="ltr" >There is a possible saving grace, namely that you could implement async only under unified memory, which would simplify greatly the whole thing: eliminate Steps 3 & 5 above and associated bookkeeping.</div>

<div dir="ltr" > </div>

<div dir="ltr" >However, most application writers that have optimized their code will tell you that unified-only program tend not to work too well, and that hybrid models (copy predictable data, use unified for unstructured data) is likely to deliver better performance. So you could simplify your implementation at the cost of precluding async for the most optimized programs.</div>

<div dir="ltr" > </div>

<div dir="ltr" >Happy to discuss it further, and explore with you alternative implementations.</div>

<div dir="ltr" ><br>Alexandre<br><br>-----------------------------------------------------------------------------------------------------<br><span style="color:#0000CD;" >Alexandre Eichenberger, Principal RSM, Advanced Compiler Technologies</span><br><span style="color:#0000CD;" >- research</span>: compiler optimization (OpenMP, GPU, SIMD)<br><span style="color:#0000CD;" >- info:</span> alexe@us.ibm.com http://www.research.ibm.com/people/a/alexe<br><span style="color:#0000CD;" >- phone</span>: 914-945-1812 (work), 914-312-3618 (cell)</div></div></div></div></div></div></div></div>

<div dir="ltr" > </div>

<div dir="ltr" > </div>

<blockquote data-history-content-modified="1" dir="ltr" style="border-left:solid #aaaaaa 2px; margin-left:5px; padding-left:5px; direction:ltr; margin-right:0px" >----- Original message -----<br>From: Gheorghe-Teod Bercea/US/IBM<br>To: "Finkel, Hal J." <hfinkel@anl.gov><br>Cc: Alexey Bataev <a.bataev@hotmail.com>, "Doerfert, Johannes" <jdoerfert@anl.gov>, "openmp-dev@lists.llvm.org" <openmp-dev@lists.llvm.org>, Ye Luo <xw111luoye@gmail.com>, Alexandre Eichenberger/Watson/IBM@IBMUS<br>Subject: Re: [Openmp-dev] OpenMP offload implicitly using streams<br>Date: Wed, Mar 20, 2019 1:49 PM<br> <br><font size="2" >I'm adding Alex to this thread. He should be able to shed some light on this issue.</font><br><br><font size="2" >Thanks,</font><br><br><font size="2" >--Doru</font><br><br><br><img alt="Inactive hide details for "Finkel, Hal J." ---03/20/2019 01:13:33 PM---Thanks, Ye. I suppose that I thought it always worked th" border="0" height="16" src="/icons/graycol.gif" width="16" ><font color="#424282" size="2" >"Finkel, Hal J." ---03/20/2019 01:13:33 PM---Thanks, Ye. I suppose that I thought it always worked that way :-) Alexey, Doru, do you know if ther</font><br><br><font color="#5F5F5F" size="2" >From: </font><font size="2" >"Finkel, Hal J." <hfinkel@anl.gov></font><br><font color="#5F5F5F" size="2" >To: </font><font size="2" >Ye Luo <xw111luoye@gmail.com></font><br><font color="#5F5F5F" size="2" >Cc: </font><font size="2" >"openmp-dev@lists.llvm.org" <openmp-dev@lists.llvm.org>, Alexey Bataev <a.bataev@hotmail.com>, Gheorghe-Teod Bercea <gheorghe-teod.bercea@ibm.com>, "Doerfert, Johannes" <jdoerfert@anl.gov></font><br><font color="#5F5F5F" size="2" >Date: </font><font size="2" >03/20/2019 01:13 PM</font><br><font color="#5F5F5F" size="2" >Subject: </font><font size="2" >Re: [Openmp-dev] OpenMP offload implicitly using streams</font>

<hr align="left" size="2" style="color:#8091A5; " width="100%" ><br><br><br><font size="3" >Thanks, Ye. I suppose that I thought it always worked that way :-)</font>

<p><font size="3" >Alexey, Doru, do you know if there's any semantic problem or other concerns with enabling this option and/or making it the default?</font></p>

<p><font size="3" > -Hal</font></p>

<p><font size="3" >On 3/20/19 11:32 AM, Ye Luo via Openmp-dev wrote:</font></p>

<ul style="padding-left: 36pt; margin-left: 0px; list-style-type: none;" >        <li><font size="3" >Hi all,</font><br>        <font size="3" >After going through the source, I didn't find CUDA stream support.</font><br>        <font size="3" >Luckily, I only need to add</font><br>        <font size="3" >#define CUDA_API_PER_THREAD_DEFAULT_STREAM</font><br>        <font size="3" >before</font><br>        <font size="3" >#include <cuda.h></font><br>        <font size="3" >in libomptarget/plugins/cuda/src/rtl.cpp</font><br>        <font size="3" >Then the multiple target goes to different streams and may execute concurrently.</font><br>        <font size="3" >#pragma omp parallel</font><br>        <font size="3" >{</font><br>        <font size="3" >  #pragma omp target</font><br>        <font size="3" >  {</font><br>        <font size="3" >    //offload computation</font><br>        <font size="3" >  }</font><br>        <font size="3" >}</font><br>        <font size="3" >This is exactly I want.</font><br>        <br>        <font size="3" >I know the XL compiler uses streams in a different way but achieves similar effects.</font><br>        <font size="3" >Is there anyone working on using streams with openmp target in llvm?</font><br>        <font size="3" >Will clang-ykt get something similar to XL and upstream to the mainline?</font><br>        <br>        <font size="3" >If we just add #define CUDA_API_PER_THREAD_DEFAULT_STREAM in the cuda rtl, will it be a trouble?</font><br>        <font size="3" >As a compiler user, I'd like to have a better solution rather than having a patch just for myself.</font><br>        <br>        <font size="3" >Best,</font><br>        <font size="3" >Ye</font><br>        <font size="3" >===================<br>        Ye Luo, Ph.D.<br>        Computational Science Division & Leadership Computing Facility<br>        Argonne National Laboratory</font><br>        <br>        <br>        <font size="3" >Ye Luo <</font><a href="mailto:xw111luoye@gmail.com" target="_blank"><u><font color="#0000FF" size="3" >xw111luoye@gmail.com</font></u></a><font size="3" >> 于2019年3月17日周日 下午2:26写道：</font>

        <ul style="padding-left: 9pt; margin-left: 0px; list-style-type: none;" >                <li><font size="3" >Hi,</font><br>                <font size="3" >How to turn on streams when using OpenMP offload?</font><br>                <font size="3" >When different host threads individually start target regions (even not using nowait). The offloaded computation goes to different CUDA streams and may execute concurrently. This is currently available in XL.</font><br>                <font size="3" >With Clang, nvprof shows only the run only uses the default stream.</font><br>                <font size="3" >Is there a way to do that with Clang?</font><br>                <font size="3" >On the other hand,</font><br>                <font size="3" >nvcc has option --</font><i><font size="3" >default</font></i><font size="3" >-</font><i><font size="3" >stream per</font></i><font size="3" >-</font><i><font size="3" >thread</font></i><br>                <font size="3" >I'm not familar with clang CUDA, is there a similar option?</font><br>                <font size="3" >Best,</font><br>                <font size="3" >Ye</font><br>                <font size="3" >===================<br>                Ye Luo, Ph.D.<br>                Computational Science Division & Leadership Computing Facility<br>                Argonne National Laboratory</font></li>        </ul>        <br>        <tt><font face="" size="3" >_______________________________________________<br>        Openmp-dev mailing list</font></tt><br>        <a href="mailto:Openmp-dev@lists.llvm.org" target="_blank"><tt><u><font color="#0000FF" face="" size="3" >Openmp-dev@lists.llvm.org</font></u></tt></a><br>        <a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank"><tt><u><font color="#0000FF" face="" size="3" >https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</font></u></tt></a></li></ul><tt><font face="" size="3" >--<br>Hal Finkel<br>Lead, Compiler Technology and Programming Languages<br>Leadership Computing Facility<br>Argonne National Laboratory</font></tt></blockquote>

<div dir="ltr" > </div></div><BR>