[Openmp-dev] OpenMP offload implicitly using streams

Wed Mar 20 12:28:39 PDT 2019

Thank you Alex for the explanation. It is a very nice read to understand
how to fully async offload working. I believe this is the eventual goal.
But it takes time to get in.

At the moment, I'm looking for a way which just needs only blocking offload
but gets concurrent execution.
My code pattern is
#pragma omp parallel
{ // parallel is at the very high level of the code hierarchy and contains
loops over target regions
  #pragma omp target
  { // target is very local
    //offload computation
  }
}
Using 1 stream per host thread easily achieves concurrent execution.

I noticed some presentations showing a pipelining offload example using
nowait
http://on-demand.gputechconf.com/gtc/2018/presentation/s8344-openmp-on-gpus-first-experiences-and-best-practices.pdf
slide 30.
If we enable one stream per thread. The pipelining can work as well even
with blocking offload.
#pragma omp parallel for
for(int i=0; i<nblocks; i++)
{
  #pragma omp target
  {
    //offload the computation of one block
  }
}

So I'm thinking of turning on one stream per thread to gain something but
I'm not aware of any negative side.

Best,
Ye
===================
Ye Luo, Ph.D.
Computational Science Division & Leadership Computing Facility
Argonne National Laboratory

Alexandre Eichenberger <alexe at us.ibm.com> 于2019年3月20日周三 下午1:26写道：

> Hal,
>
> Supporting async for targets is not trivial. Right now, since everything
> is "blocking" a target can be easily separated into 5 tasks:
>
>
>    1. wait for dependences to resolve
>    2. perform all the copy to the device
>    3. execute the target
>    4. perform all the copy from device to host
>    5. resolve all dependences for other tasks
>
> Going async ala LOMP has many advantages: all operations are asynchronous,
> and dependences from target to target tasks are directly enforced on the
> device. To us, this was the only way that users could effectively hide the
> high overhead of generating targets, by enqueuing many dependent target
> tasks on the host, to prime the pipe of targets on the devices.
>
> To do so, I believe you have to do the following.
>
>    1. wait for all the host dependences; dependences from other device
>    targets are tabulated but not waited for
>    2. select a stream (of one stream from a dependent target, if any)
>    and enqueue wait for event for all tabulated dependences
>    3. enqueue all copy to device on stream (or enqueue sync event for
>    data currently being copied over by other targets)
>    4. enqueue computation on stream
>    5. enqueue all copy from device on stream (this is speculative, as ref
>    count may increase by another target executed before the data is actually
>    copied back, but it's legal)
>    6. cleanup
>       1. blocking: wait for stream to be finished
>       2. non-blocking: have a callback from CUDA (which involve a
>       separate thread) or have active polling by OpenMP threads when doing
>       nothing and/or before doing a subsequent target task to determine when
>       stream is finished
>       3. when 1 or 2 above are finished, cleanup the map data structures,
>       resolve dependences for dependent tasks.
>
> This is compounded by the fact that async data movements are only
> performed with pinned memory, and any CUDA memory cannot be allocated
> directly as it is a synchronizing event. So runtime must handle it's own
> pool of device and pinned memory, which requires additional work in Steps
> 3, 5, and 6.3 above.
>
> To perform the cleanup in Step 6, you also need to cache all info
> associated with a target in a dedicated data structure.
>
> As you may have noticed, if you want some async to work, you have
> basically to treat all target as async;  the synchronous ones differ only
> by having an explicit wait in Step 6.1. So all this handling is in the
> critical path.
>
> You will also need to carefully managed CUDA events associated with any
> explicit data movements, as subsequent target operations may be dependent
> on an actual memory operation to complete (in either directions).
>
> This has been done in LOMP, was it fun, maybe not, but it's all feasible.
>
> There is a possible saving grace, namely that you could implement async
> only under unified memory, which would simplify greatly the whole thing:
> eliminate Steps 3 & 5 above and associated bookkeeping.
>
> However, most application writers that have optimized their code will tell
> you that unified-only program tend not to work too well, and that hybrid
> models (copy predictable data, use unified for unstructured data) is likely
> to deliver better performance. So you could simplify your implementation at
> the cost of precluding async for the most optimized programs.
>
> Happy to discuss it further, and explore with you alternative
> implementations.
>
> Alexandre
>
>
> -----------------------------------------------------------------------------------------------------
> Alexandre Eichenberger, Principal RSM, Advanced Compiler Technologies
> - research: compiler optimization (OpenMP, GPU, SIMD)
> - info: alexe at us.ibm.com http://www.research.ibm.com/people/a/alexe
> - phone: 914-945-1812 (work), 914-312-3618 (cell)
>
>
>
> ----- Original message -----
> From: Gheorghe-Teod Bercea/US/IBM
> To: "Finkel, Hal J." <hfinkel at anl.gov>
> Cc: Alexey Bataev <a.bataev at hotmail.com>, "Doerfert, Johannes" <
> jdoerfert at anl.gov>, "openmp-dev at lists.llvm.org" <openmp-dev at lists.llvm.org>,
> Ye Luo <xw111luoye at gmail.com>, Alexandre Eichenberger/Watson/IBM at IBMUS
> Subject: Re: [Openmp-dev] OpenMP offload implicitly using streams
> Date: Wed, Mar 20, 2019 1:49 PM
>
> I'm adding Alex to this thread. He should be able to shed some light on
> this issue.
>
> Thanks,
>
> --Doru
>
>
> [image: Inactive hide details for "Finkel, Hal J." ---03/20/2019 01:13:33
> PM---Thanks, Ye. I suppose that I thought it always worked th]"Finkel,
> Hal J." ---03/20/2019 01:13:33 PM---Thanks, Ye. I suppose that I thought it
> always worked that way :-) Alexey, Doru, do you know if ther
>
> From: "Finkel, Hal J." <hfinkel at anl.gov>
> To: Ye Luo <xw111luoye at gmail.com>
> Cc: "openmp-dev at lists.llvm.org" <openmp-dev at lists.llvm.org>, Alexey
> Bataev <a.bataev at hotmail.com>, Gheorghe-Teod Bercea <
> gheorghe-teod.bercea at ibm.com>, "Doerfert, Johannes" <jdoerfert at anl.gov>
> Date: 03/20/2019 01:13 PM
> Subject: Re: [Openmp-dev] OpenMP offload implicitly using streams
> ------------------------------
>
>
>
> Thanks, Ye. I suppose that I thought it always worked that way :-)
>
> Alexey, Doru, do you know if there's any semantic problem or other
> concerns with enabling this option and/or making it the default?
>
>  -Hal
>
> On 3/20/19 11:32 AM, Ye Luo via Openmp-dev wrote:
>
>    - Hi all,
>    After going through the source, I didn't find CUDA stream support.
>    Luckily, I only need to add
>    #define CUDA_API_PER_THREAD_DEFAULT_STREAM
>    before
>    #include <cuda.h>
>    in libomptarget/plugins/cuda/src/rtl.cpp
>    Then the multiple target goes to different streams and may execute
>    concurrently.
>    #pragma omp parallel
>    {
>      #pragma omp target
>      {
>        //offload computation
>      }
>    }
>    This is exactly I want.
>
>    I know the XL compiler uses streams in a different way but achieves
>    similar effects.
>    Is there anyone working on using streams with openmp target in llvm?
>    Will clang-ykt get something similar to XL and upstream to the
>    mainline?
>
>    If we just add #define CUDA_API_PER_THREAD_DEFAULT_STREAM in the cuda
>    rtl, will it be a trouble?
>    As a compiler user, I'd like to have a better solution rather than
>    having a patch just for myself.
>
>    Best,
>    Ye
>    ===================
>    Ye Luo, Ph.D.
>    Computational Science Division & Leadership Computing Facility
>    Argonne National Laboratory
>
>
>    Ye Luo <*xw111luoye at gmail.com* <xw111luoye at gmail.com>> 于2019年3月17日周日
>    下午2:26写道：
>       - Hi,
>       How to turn on streams when using OpenMP offload?
>       When different host threads individually start target regions (even
>       not using nowait). The offloaded computation goes to different CUDA streams
>       and may execute concurrently. This is currently available in XL.
>       With Clang, nvprof shows only the run only uses the default stream.
>       Is there a way to do that with Clang?
>       On the other hand,
>       nvcc has option --*default*-*stream per*-*thread*
>       I'm not familar with clang CUDA, is there a similar option?
>       Best,
>       Ye
>       ===================
>       Ye Luo, Ph.D.
>       Computational Science Division & Leadership Computing Facility
>       Argonne National Laboratory
>
>    _______________________________________________
>    Openmp-dev mailing list
>    *Openmp-dev at lists.llvm.org* <Openmp-dev at lists.llvm.org>
>    *https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev*
>    <https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20190320/960e8f72/attachment.html>