[Openmp-dev] OpenMP offload implicitly using streams

Ye Luo via Openmp-dev openmp-dev at lists.llvm.org
Wed Mar 20 12:28:39 PDT 2019

Thank you Alex for the explanation. It is a very nice read to understand
how to fully async offload working. I believe this is the eventual goal.
But it takes time to get in.

At the moment, I'm looking for a way which just needs only blocking offload
but gets concurrent execution.
My code pattern is
#pragma omp parallel
{ // parallel is at the very high level of the code hierarchy and contains
loops over target regions
  #pragma omp target
  { // target is very local
    //offload computation
Using 1 stream per host thread easily achieves concurrent execution.

I noticed some presentations showing a pipelining offload example using
slide 30.
If we enable one stream per thread. The pipelining can work as well even
with blocking offload.
#pragma omp parallel for
for(int i=0; i<nblocks; i++)
  #pragma omp target
    //offload the computation of one block

So I'm thinking of turning on one stream per thread to gain something but
I'm not aware of any negative side.

Ye Luo, Ph.D.
Computational Science Division & Leadership Computing Facility
Argonne National Laboratory

Alexandre Eichenberger <alexe at us.ibm.com> 于2019年3月20日周三 下午1:26写道:

> Hal,
> Supporting async for targets is not trivial. Right now, since everything
> is "blocking" a target can be easily separated into 5 tasks:
>    1. wait for dependences to resolve
>    2. perform all the copy to the device
>    3. execute the target
>    4. perform all the copy from device to host
>    5. resolve all dependences for other tasks
> Going async ala LOMP has many advantages: all operations are asynchronous,
> and dependences from target to target tasks are directly enforced on the
> device. To us, this was the only way that users could effectively hide the
> high overhead of generating targets, by enqueuing many dependent target
> tasks on the host, to prime the pipe of targets on the devices.
> To do so, I believe you have to do the following.
>    1. wait for all the host dependences; dependences from other device
>    targets are tabulated but not waited for
>    2. select a stream (of one stream from a dependent target, if any)
>    and enqueue wait for event for all tabulated dependences
>    3. enqueue all copy to device on stream (or enqueue sync event for
>    data currently being copied over by other targets)
>    4. enqueue computation on stream
>    5. enqueue all copy from device on stream (this is speculative, as ref
>    count may increase by another target executed before the data is actually
>    copied back, but it's legal)
>    6. cleanup
>       1. blocking: wait for stream to be finished
>       2. non-blocking: have a callback from CUDA (which involve a
>       separate thread) or have active polling by OpenMP threads when doing
>       nothing and/or before doing a subsequent target task to determine when
>       stream is finished
>       3. when 1 or 2 above are finished, cleanup the map data structures,
>       resolve dependences for dependent tasks.
> This is compounded by the fact that async data movements are only
> performed with pinned memory, and any CUDA memory cannot be allocated
> directly as it is a synchronizing event. So runtime must handle it's own
> pool of device and pinned memory, which requires additional work in Steps
> 3, 5, and 6.3 above.
> To perform the cleanup in Step 6, you also need to cache all info
> associated with a target in a dedicated data structure.
> As you may have noticed, if you want some async to work, you have
> basically to treat all target as async;  the synchronous ones differ only
> by having an explicit wait in Step 6.1. So all this handling is in the
> critical path.
> You will also need to carefully managed CUDA events associated with any
> explicit data movements, as subsequent target operations may be dependent
> on an actual memory operation to complete (in either directions).
> This has been done in LOMP, was it fun, maybe not, but it's all feasible.
> There is a possible saving grace, namely that you could implement async
> only under unified memory, which would simplify greatly the whole thing:
> eliminate Steps 3 & 5 above and associated bookkeeping.
> However, most application writers that have optimized their code will tell
> you that unified-only program tend not to work too well, and that hybrid
> models (copy predictable data, use unified for unstructured data) is likely
> to deliver better performance. So you could simplify your implementation at
> the cost of precluding async for the most optimized programs.
> Happy to discuss it further, and explore with you alternative
> implementations.
> Alexandre
> -----------------------------------------------------------------------------------------------------
> Alexandre Eichenberger, Principal RSM, Advanced Compiler Technologies
> - research: compiler optimization (OpenMP, GPU, SIMD)
> - info: alexe at us.ibm.com http://www.research.ibm.com/people/a/alexe
> - phone: 914-945-1812 (work), 914-312-3618 (cell)
> ----- Original message -----
> From: Gheorghe-Teod Bercea/US/IBM
> To: "Finkel, Hal J." <hfinkel at anl.gov>
> Cc: Alexey Bataev <a.bataev at hotmail.com>, "Doerfert, Johannes" <
> jdoerfert at anl.gov>, "openmp-dev at lists.llvm.org" <openmp-dev at lists.llvm.org>,
> Ye Luo <xw111luoye at gmail.com>, Alexandre Eichenberger/Watson/IBM at IBMUS
> Subject: Re: [Openmp-dev] OpenMP offload implicitly using streams
> Date: Wed, Mar 20, 2019 1:49 PM
> I'm adding Alex to this thread. He should be able to shed some light on
> this issue.
> Thanks,
> --Doru
> [image: Inactive hide details for "Finkel, Hal J." ---03/20/2019 01:13:33
> PM---Thanks, Ye. I suppose that I thought it always worked th]"Finkel,
> Hal J." ---03/20/2019 01:13:33 PM---Thanks, Ye. I suppose that I thought it
> always worked that way :-) Alexey, Doru, do you know if ther
> From: "Finkel, Hal J." <hfinkel at anl.gov>
> To: Ye Luo <xw111luoye at gmail.com>
> Cc: "openmp-dev at lists.llvm.org" <openmp-dev at lists.llvm.org>, Alexey
> Bataev <a.bataev at hotmail.com>, Gheorghe-Teod Bercea <
> gheorghe-teod.bercea at ibm.com>, "Doerfert, Johannes" <jdoerfert at anl.gov>
> Date: 03/20/2019 01:13 PM
> Subject: Re: [Openmp-dev] OpenMP offload implicitly using streams
> ------------------------------
> Thanks, Ye. I suppose that I thought it always worked that way :-)
> Alexey, Doru, do you know if there's any semantic problem or other
> concerns with enabling this option and/or making it the default?
>  -Hal
> On 3/20/19 11:32 AM, Ye Luo via Openmp-dev wrote:
>    - Hi all,
>    After going through the source, I didn't find CUDA stream support.
>    Luckily, I only need to add
>    before
>    #include <cuda.h>
>    in libomptarget/plugins/cuda/src/rtl.cpp
>    Then the multiple target goes to different streams and may execute
>    concurrently.
>    #pragma omp parallel
>    {
>      #pragma omp target
>      {
>        //offload computation
>      }
>    }
>    This is exactly I want.
>    I know the XL compiler uses streams in a different way but achieves
>    similar effects.
>    Is there anyone working on using streams with openmp target in llvm?
>    Will clang-ykt get something similar to XL and upstream to the
>    mainline?
>    If we just add #define CUDA_API_PER_THREAD_DEFAULT_STREAM in the cuda
>    rtl, will it be a trouble?
>    As a compiler user, I'd like to have a better solution rather than
>    having a patch just for myself.
>    Best,
>    Ye
>    ===================
>    Ye Luo, Ph.D.
>    Computational Science Division & Leadership Computing Facility
>    Argonne National Laboratory
>    Ye Luo <*xw111luoye at gmail.com* <xw111luoye at gmail.com>> 于2019年3月17日周日
>    下午2:26写道:
>       - Hi,
>       How to turn on streams when using OpenMP offload?
>       When different host threads individually start target regions (even
>       not using nowait). The offloaded computation goes to different CUDA streams
>       and may execute concurrently. This is currently available in XL.
>       With Clang, nvprof shows only the run only uses the default stream.
>       Is there a way to do that with Clang?
>       On the other hand,
>       nvcc has option --*default*-*stream per*-*thread*
>       I'm not familar with clang CUDA, is there a similar option?
>       Best,
>       Ye
>       ===================
>       Ye Luo, Ph.D.
>       Computational Science Division & Leadership Computing Facility
>       Argonne National Laboratory
>    _______________________________________________
>    Openmp-dev mailing list
>    *Openmp-dev at lists.llvm.org* <Openmp-dev at lists.llvm.org>
>    *https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev*
>    <https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20190320/960e8f72/attachment.html>

More information about the Openmp-dev mailing list