[Openmp-dev] OpenMP offload implicitly using streams

Wed Mar 20 17:14:06 PDT 2019

Hi Alex,
Could you explain why 2.3 is inside the lock region? I feel maintaining the
memory consistency is the responsibility of code developers anyway.
Ye
===================
Ye Luo, Ph.D.
Computational Science Division & Leadership Computing Facility
Argonne National Laboratory

Alexandre Eichenberger <alexe at us.ibm.com> 于2019年3月20日周三 下午6:37写道：

> Hal,
>
> I don't have results from our initial experiments, and generally
> I implemented the scheme with the most asynchrony. You can probably make a
> cuda example of the scheme you are looking into, and compare with a similar
> scheme using LOMP or also cuda calls directly.
>
>
>
> Ye, Hal,
>
> It is true that you have one pipe in each direction, but it is one thing
> to serialize the data transfer vs another with respect to a lock. When
> mapping, the pattern is probably like this in the LLVM runtime (on entry to
> traget region)
>
>
>    1. get map lock
>    2. for each map
>       1. test if already there, if not
>       2. alloc
>       3. cuda mem copy if needed by map operation
>    3. release map lock
>
> And under the cover, synchronous memory operations in 3 will also allocate
> a pinned memory location, copy into the pinned memory, and then transfer it
> over to the device. So what you are talking about being serialized is one
> part of step 3... all the other ones are still serialized by the external
> lock.
>
> And most importantly, you have to ensure the alloc does not use cudamalloc
> (and equivalent pinned memory operations) because it will prevent any
> asynchronous behavior)
>
>
> https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#implicit-synchronization
>
> Along with using distinct streams for each thread, that is probably the
> most important change to do.
>
> Alexandre
>
>
> -----------------------------------------------------------------------------------------------------
> Alexandre Eichenberger, Principal RSM, Advanced Compiler Technologies
> - research: compiler optimization (OpenMP, GPU, SIMD)
> - info: alexe at us.ibm.com http://www.research.ibm.com/people/a/alexe
> - phone: 914-945-1812 (work), 914-312-3618 (cell)
>
>
>
> ----- Original message -----
> From: Ye Luo <xw111luoye at gmail.com>
> To: "Finkel, Hal J." <hfinkel at anl.gov>
> Cc: Alexandre Eichenberger <alexe at us.ibm.com>, "a.bataev at hotmail.com" <
> a.bataev at hotmail.com>, "Kevin K O'Brien" <caomhin at us.ibm.com>,
> Gheorghe-Teod Bercea <Gheorghe-Teod.Bercea at ibm.com>, "Doerfert, Johannes"
> <jdoerfert at anl.gov>, "openmp-dev at lists.llvm.org" <
> openmp-dev at lists.llvm.org>
> Subject: Re: [Openmp-dev] OpenMP offload implicitly using streams
> Date: Wed, Mar 20, 2019 5:55 PM
>
> Hal,
> There is no concurrent data transfer because the hardware only has one
> pipe for each direction, D2H and H2D. It is anyway the bottleneck. I don't
> think we can workaround that and this is expected.
> I do want data transferring over lapping with computation from different
> host threads(streams) and also computation overlapping with each other.
> Here is what I got from nvprof. Both overlapping is happening.
> Ye
> [image: Screenshot from 2019-03-20 16:48:31.png]
>
>
> ===================
> Ye Luo, Ph.D.
> Computational Science Division & Leadership Computing Facility
> Argonne National Laboratory
>
> Finkel, Hal J. <hfinkel at anl.gov> 于2019年3月20日周三 下午4:37写道：
>
>
> On 3/20/19 2:55 PM, Alexandre Eichenberger wrote:
>
> Ye, Hal,
>
> Just using a distinct stream per thread does not give you necessarily
> concurrency for transfers. E.g. is the runtime holding a single map lock
> during the entire mapping (including the data transfer)? I don't recall
> precisely what LLVM does, but I think that was the case. If so, you don't
> have any asynchrony during transfers, but you have a simple implementation.
> You could relax this by having one lock per mapped memory location, so that
> a second target mapping a same data would wait until the data is moved. Or
> you can use CUDA events per data transfers, and having a second thread's
> target computation wait for that event.
>
>
>
> Ye, do you know if the benefit you're seeing is from concurrency in data
> transfers, or in compute, or both?
>
>
>
>
> Also, early reports from our users were that multiple threads doing CUDA
> calls was bad because the driver's locking scheme was poor. Hopefully, this
> has been addressed since. I would conduct a simple experiment to test this,
> before you recommend your users to go that way.
>
>
>
> My understanding from talking to folks at NVIDIA is that this has been an
> area they've actively worked on improving. It may be better, although
> testing is certainly required.
>
> Thanks again,
>
> Hal
>
>
>
>
>
> Alexandre
>
>
> -----------------------------------------------------------------------------------------------------
> Alexandre Eichenberger, Principal RSM, Advanced Compiler Technologies
> - research: compiler optimization (OpenMP, GPU, SIMD)
> - info: alexe at us.ibm.com http://www.research.ibm.com/people/a/alexe
> - phone: 914-945-1812 (work), 914-312-3618 (cell)
>
>
>
> ----- Original message -----
> From: Ye Luo <xw111luoye at gmail.com> <xw111luoye at gmail.com>
> To: Alexandre Eichenberger <alexe at us.ibm.com> <alexe at us.ibm.com>
> Cc: hfinkel at anl.gov, a.bataev at hotmail.com, jdoerfert at anl.gov,
> openmp-dev at lists.llvm.org, Gheorghe-Teod Bercea
> <Gheorghe-Teod.Bercea at ibm.com> <Gheorghe-Teod.Bercea at ibm.com>, "Kevin K
> O'Brien" <caomhin at us.ibm.com> <caomhin at us.ibm.com>
> Subject: Re: [Openmp-dev] OpenMP offload implicitly using streams
> Date: Wed, Mar 20, 2019 3:28 PM
>
> Thank you Alex for the explanation. It is a very nice read to understand
> how to fully async offload working. I believe this is the eventual goal.
> But it takes time to get in.
>
> At the moment, I'm looking for a way which just needs only blocking
> offload but gets concurrent execution.
> My code pattern is
> #pragma omp parallel
> { // parallel is at the very high level of the code hierarchy and contains
> loops over target regions
>   #pragma omp target
>   { // target is very local
>     //offload computation
>   }
> }
> Using 1 stream per host thread easily achieves concurrent execution.
>
> I noticed some presentations showing a pipelining offload example using
> nowait
>
> http://on-demand.gputechconf.com/gtc/2018/presentation/s8344-openmp-on-gpus-first-experiences-and-best-practices.pdf
> slide 30.
> If we enable one stream per thread. The pipelining can work as well even
> with blocking offload.
> #pragma omp parallel for
> for(int i=0; i<nblocks; i++)
> {
>   #pragma omp target
>   {
>     //offload the computation of one block
>   }
> }
>
>
> So I'm thinking of turning on one stream per thread to gain something but
> I'm not aware of any negative side.
>
> Best,
> Ye
> ===================
> Ye Luo, Ph.D.
> Computational Science Division & Leadership Computing Facility
> Argonne National Laboratory
>
> Alexandre Eichenberger <alexe at us.ibm.com> 于2019年3月20日周三 下午1:26写道：
>
> Hal,
>
> Supporting async for targets is not trivial. Right now, since everything
> is "blocking" a target can be easily separated into 5 tasks:
>
>
>    1. wait for dependences to resolve
>    2. perform all the copy to the device
>    3. execute the target
>    4. perform all the copy from device to host
>    5. resolve all dependences for other tasks
>
> Going async ala LOMP has many advantages: all operations are asynchronous,
> and dependences from target to target tasks are directly enforced on the
> device. To us, this was the only way that users could effectively hide the
> high overhead of generating targets, by enqueuing many dependent target
> tasks on the host, to prime the pipe of targets on the devices.
>
> To do so, I believe you have to do the following.
>
>    1. wait for all the host dependences; dependences from other device
>    targets are tabulated but not waited for
>    2. select a stream (of one stream from a dependent target, if any)
>    and enqueue wait for event for all tabulated dependences
>    3. enqueue all copy to device on stream (or enqueue sync event for
>    data currently being copied over by other targets)
>    4. enqueue computation on stream
>    5. enqueue all copy from device on stream (this is speculative, as ref
>    count may increase by another target executed before the data is actually
>    copied back, but it's legal)
>    6. cleanup
>       1. blocking: wait for stream to be finished
>       2. non-blocking: have a callback from CUDA (which involve a
>       separate thread) or have active polling by OpenMP threads when doing
>       nothing and/or before doing a subsequent target task to determine when
>       stream is finished
>       3. when 1 or 2 above are finished, cleanup the map data structures,
>       resolve dependences for dependent tasks.
>
> This is compounded by the fact that async data movements are only
> performed with pinned memory, and any CUDA memory cannot be allocated
> directly as it is a synchronizing event. So runtime must handle it's own
> pool of device and pinned memory, which requires additional work in Steps
> 3, 5, and 6.3 above.
>
> To perform the cleanup in Step 6, you also need to cache all info
> associated with a target in a dedicated data structure.
>
> As you may have noticed, if you want some async to work, you have
> basically to treat all target as async;  the synchronous ones differ only
> by having an explicit wait in Step 6.1. So all this handling is in the
> critical path.
>
> You will also need to carefully managed CUDA events associated with any
> explicit data movements, as subsequent target operations may be dependent
> on an actual memory operation to complete (in either directions).
>
> This has been done in LOMP, was it fun, maybe not, but it's all feasible.
>
> There is a possible saving grace, namely that you could implement async
> only under unified memory, which would simplify greatly the whole thing:
> eliminate Steps 3 & 5 above and associated bookkeeping.
>
> However, most application writers that have optimized their code will tell
> you that unified-only program tend not to work too well, and that hybrid
> models (copy predictable data, use unified for unstructured data) is likely
> to deliver better performance. So you could simplify your implementation at
> the cost of precluding async for the most optimized programs.
>
> Happy to discuss it further, and explore with you alternative
> implementations.
>
> Alexandre
>
>
> -----------------------------------------------------------------------------------------------------
> Alexandre Eichenberger, Principal RSM, Advanced Compiler Technologies
> - research: compiler optimization (OpenMP, GPU, SIMD)
> - info: alexe at us.ibm.com http://www.research.ibm.com/people/a/alexe
> - phone: 914-945-1812 (work), 914-312-3618 (cell)
>
>
>
> ----- Original message -----
> From: Gheorghe-Teod Bercea/US/IBM
> To: "Finkel, Hal J." <hfinkel at anl.gov>
> Cc: Alexey Bataev <a.bataev at hotmail.com>, "Doerfert, Johannes" <
> jdoerfert at anl.gov>, "openmp-dev at lists.llvm.org" <openmp-dev at lists.llvm.org>,
> Ye Luo <xw111luoye at gmail.com>, Alexandre Eichenberger/Watson/IBM at IBMUS
> Subject: Re: [Openmp-dev] OpenMP offload implicitly using streams
> Date: Wed, Mar 20, 2019 1:49 PM
>
> I'm adding Alex to this thread. He should be able to shed some light on
> this issue.
>
> Thanks,
>
> --Doru
>
>
> "Finkel, Hal J." ---03/20/2019 01:13:33 PM---Thanks, Ye. I suppose that I
> thought it always worked that way :-) Alexey, Doru, do you know if ther
>
> From: "Finkel, Hal J." <hfinkel at anl.gov>
> To: Ye Luo <xw111luoye at gmail.com>
> Cc: "openmp-dev at lists.llvm.org" <openmp-dev at lists.llvm.org>, Alexey
> Bataev <a.bataev at hotmail.com>, Gheorghe-Teod Bercea <
> gheorghe-teod.bercea at ibm.com>, "Doerfert, Johannes" <jdoerfert at anl.gov>
> Date: 03/20/2019 01:13 PM
> Subject: Re: [Openmp-dev] OpenMP offload implicitly using streams
> ------------------------------
>
>
>
> Thanks, Ye. I suppose that I thought it always worked that way :-)
>
> Alexey, Doru, do you know if there's any semantic problem or other
> concerns with enabling this option and/or making it the default?
>
>  -Hal
>
> On 3/20/19 11:32 AM, Ye Luo via Openmp-dev wrote:
>
>    - Hi all,
>    After going through the source, I didn't find CUDA stream support.
>    Luckily, I only need to add
>    #define CUDA_API_PER_THREAD_DEFAULT_STREAM
>    before
>    #include <cuda.h>
>    in libomptarget/plugins/cuda/src/rtl.cpp
>    Then the multiple target goes to different streams and may execute
>    concurrently.
>    #pragma omp parallel
>    {
>      #pragma omp target
>      {
>        //offload computation
>      }
>    }
>    This is exactly I want.
>
>    I know the XL compiler uses streams in a different way but achieves
>    similar effects.
>    Is there anyone working on using streams with openmp target in llvm?
>    Will clang-ykt get something similar to XL and upstream to the
>    mainline?
>
>    If we just add #define CUDA_API_PER_THREAD_DEFAULT_STREAM in the cuda
>    rtl, will it be a trouble?
>    As a compiler user, I'd like to have a better solution rather than
>    having a patch just for myself.
>
>    Best,
>    Ye
>    ===================
>    Ye Luo, Ph.D.
>    Computational Science Division & Leadership Computing Facility
>    Argonne National Laboratory
>
>
>    Ye Luo <*xw111luoye at gmail.com* <xw111luoye at gmail.com>> 于2019年3月17日周日
>    下午2:26写道：
>       - Hi,
>       How to turn on streams when using OpenMP offload?
>       When different host threads individually start target regions (even
>       not using nowait). The offloaded computation goes to different CUDA streams
>       and may execute concurrently. This is currently available in XL.
>       With Clang, nvprof shows only the run only uses the default stream.
>       Is there a way to do that with Clang?
>       On the other hand,
>       nvcc has option --*default*-*stream per*-*thread*
>       I'm not familar with clang CUDA, is there a similar option?
>       Best,
>       Ye
>       ===================
>       Ye Luo, Ph.D.
>       Computational Science Division & Leadership Computing Facility
>       Argonne National Laboratory
>
>    _______________________________________________
>    Openmp-dev mailing list
>    *Openmp-dev at lists.llvm.org* <Openmp-dev at lists.llvm.org>
>    *https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev*
>    <https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
>
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20190320/a90a1c74/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Image.ii_jthqqpf90.png
Type: image/png
Size: 73225 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20190320/a90a1c74/attachment-0001.png>