[Openmp-dev] OpenMP offload implicitly using streams
Ye Luo via Openmp-dev
openmp-dev at lists.llvm.org
Wed Mar 20 14:55:39 PDT 2019
Hal,
There is no concurrent data transfer because the hardware only has one pipe
for each direction, D2H and H2D. It is anyway the bottleneck. I don't think
we can workaround that and this is expected.
I do want data transferring over lapping with computation from different
host threads(streams) and also computation overlapping with each other.
Here is what I got from nvprof. Both overlapping is happening.
Ye
[image: Screenshot from 2019-03-20 16:48:31.png]
===================
Ye Luo, Ph.D.
Computational Science Division & Leadership Computing Facility
Argonne National Laboratory
Finkel, Hal J. <hfinkel at anl.gov> 于2019年3月20日周三 下午4:37写道:
>
> On 3/20/19 2:55 PM, Alexandre Eichenberger wrote:
>
> Ye, Hal,
>
> Just using a distinct stream per thread does not give you necessarily
> concurrency for transfers. E.g. is the runtime holding a single map lock
> during the entire mapping (including the data transfer)? I don't recall
> precisely what LLVM does, but I think that was the case. If so, you don't
> have any asynchrony during transfers, but you have a simple implementation.
> You could relax this by having one lock per mapped memory location, so that
> a second target mapping a same data would wait until the data is moved. Or
> you can use CUDA events per data transfers, and having a second thread's
> target computation wait for that event.
>
>
> Ye, do you know if the benefit you're seeing is from concurrency in data
> transfers, or in compute, or both?
>
>
>
> Also, early reports from our users were that multiple threads doing CUDA
> calls was bad because the driver's locking scheme was poor. Hopefully, this
> has been addressed since. I would conduct a simple experiment to test this,
> before you recommend your users to go that way.
>
>
> My understanding from talking to folks at NVIDIA is that this has been an
> area they've actively worked on improving. It may be better, although
> testing is certainly required.
>
> Thanks again,
>
> Hal
>
>
>
>
> Alexandre
>
>
> -----------------------------------------------------------------------------------------------------
> Alexandre Eichenberger, Principal RSM, Advanced Compiler Technologies
> - research: compiler optimization (OpenMP, GPU, SIMD)
> - info: alexe at us.ibm.com http://www.research.ibm.com/people/a/alexe
> - phone: 914-945-1812 (work), 914-312-3618 (cell)
>
>
>
> ----- Original message -----
> From: Ye Luo <xw111luoye at gmail.com> <xw111luoye at gmail.com>
> To: Alexandre Eichenberger <alexe at us.ibm.com> <alexe at us.ibm.com>
> Cc: hfinkel at anl.gov, a.bataev at hotmail.com, jdoerfert at anl.gov,
> openmp-dev at lists.llvm.org, Gheorghe-Teod Bercea
> <Gheorghe-Teod.Bercea at ibm.com> <Gheorghe-Teod.Bercea at ibm.com>, "Kevin K
> O'Brien" <caomhin at us.ibm.com> <caomhin at us.ibm.com>
> Subject: Re: [Openmp-dev] OpenMP offload implicitly using streams
> Date: Wed, Mar 20, 2019 3:28 PM
>
> Thank you Alex for the explanation. It is a very nice read to understand
> how to fully async offload working. I believe this is the eventual goal.
> But it takes time to get in.
>
> At the moment, I'm looking for a way which just needs only blocking
> offload but gets concurrent execution.
> My code pattern is
> #pragma omp parallel
> { // parallel is at the very high level of the code hierarchy and contains
> loops over target regions
> #pragma omp target
> { // target is very local
> //offload computation
> }
> }
> Using 1 stream per host thread easily achieves concurrent execution.
>
> I noticed some presentations showing a pipelining offload example using
> nowait
>
> http://on-demand.gputechconf.com/gtc/2018/presentation/s8344-openmp-on-gpus-first-experiences-and-best-practices.pdf
> slide 30.
> If we enable one stream per thread. The pipelining can work as well even
> with blocking offload.
> #pragma omp parallel for
> for(int i=0; i<nblocks; i++)
> {
> #pragma omp target
> {
> //offload the computation of one block
> }
> }
>
>
> So I'm thinking of turning on one stream per thread to gain something but
> I'm not aware of any negative side.
>
> Best,
> Ye
> ===================
> Ye Luo, Ph.D.
> Computational Science Division & Leadership Computing Facility
> Argonne National Laboratory
>
> Alexandre Eichenberger <alexe at us.ibm.com> 于2019年3月20日周三 下午1:26写道:
>
> Hal,
>
> Supporting async for targets is not trivial. Right now, since everything
> is "blocking" a target can be easily separated into 5 tasks:
>
>
> 1. wait for dependences to resolve
> 2. perform all the copy to the device
> 3. execute the target
> 4. perform all the copy from device to host
> 5. resolve all dependences for other tasks
>
> Going async ala LOMP has many advantages: all operations are asynchronous,
> and dependences from target to target tasks are directly enforced on the
> device. To us, this was the only way that users could effectively hide the
> high overhead of generating targets, by enqueuing many dependent target
> tasks on the host, to prime the pipe of targets on the devices.
>
> To do so, I believe you have to do the following.
>
> 1. wait for all the host dependences; dependences from other device
> targets are tabulated but not waited for
> 2. select a stream (of one stream from a dependent target, if any)
> and enqueue wait for event for all tabulated dependences
> 3. enqueue all copy to device on stream (or enqueue sync event for
> data currently being copied over by other targets)
> 4. enqueue computation on stream
> 5. enqueue all copy from device on stream (this is speculative, as ref
> count may increase by another target executed before the data is actually
> copied back, but it's legal)
> 6. cleanup
> 1. blocking: wait for stream to be finished
> 2. non-blocking: have a callback from CUDA (which involve a
> separate thread) or have active polling by OpenMP threads when doing
> nothing and/or before doing a subsequent target task to determine when
> stream is finished
> 3. when 1 or 2 above are finished, cleanup the map data structures,
> resolve dependences for dependent tasks.
>
> This is compounded by the fact that async data movements are only
> performed with pinned memory, and any CUDA memory cannot be allocated
> directly as it is a synchronizing event. So runtime must handle it's own
> pool of device and pinned memory, which requires additional work in Steps
> 3, 5, and 6.3 above.
>
> To perform the cleanup in Step 6, you also need to cache all info
> associated with a target in a dedicated data structure.
>
> As you may have noticed, if you want some async to work, you have
> basically to treat all target as async; the synchronous ones differ only
> by having an explicit wait in Step 6.1. So all this handling is in the
> critical path.
>
> You will also need to carefully managed CUDA events associated with any
> explicit data movements, as subsequent target operations may be dependent
> on an actual memory operation to complete (in either directions).
>
> This has been done in LOMP, was it fun, maybe not, but it's all feasible.
>
> There is a possible saving grace, namely that you could implement async
> only under unified memory, which would simplify greatly the whole thing:
> eliminate Steps 3 & 5 above and associated bookkeeping.
>
> However, most application writers that have optimized their code will tell
> you that unified-only program tend not to work too well, and that hybrid
> models (copy predictable data, use unified for unstructured data) is likely
> to deliver better performance. So you could simplify your implementation at
> the cost of precluding async for the most optimized programs.
>
> Happy to discuss it further, and explore with you alternative
> implementations.
>
> Alexandre
>
>
> -----------------------------------------------------------------------------------------------------
> Alexandre Eichenberger, Principal RSM, Advanced Compiler Technologies
> - research: compiler optimization (OpenMP, GPU, SIMD)
> - info: alexe at us.ibm.com http://www.research.ibm.com/people/a/alexe
> - phone: 914-945-1812 (work), 914-312-3618 (cell)
>
>
>
> ----- Original message -----
> From: Gheorghe-Teod Bercea/US/IBM
> To: "Finkel, Hal J." <hfinkel at anl.gov>
> Cc: Alexey Bataev <a.bataev at hotmail.com>, "Doerfert, Johannes" <
> jdoerfert at anl.gov>, "openmp-dev at lists.llvm.org" <openmp-dev at lists.llvm.org>,
> Ye Luo <xw111luoye at gmail.com>, Alexandre Eichenberger/Watson/IBM at IBMUS
> Subject: Re: [Openmp-dev] OpenMP offload implicitly using streams
> Date: Wed, Mar 20, 2019 1:49 PM
>
> I'm adding Alex to this thread. He should be able to shed some light on
> this issue.
>
> Thanks,
>
> --Doru
>
>
> [image: Inactive hide details for "Finkel, Hal J." ---03/20/2019 01:13:33
> PM---Thanks, Ye. I suppose that I thought it always worked th]"Finkel,
> Hal J." ---03/20/2019 01:13:33 PM---Thanks, Ye. I suppose that I thought it
> always worked that way :-) Alexey, Doru, do you know if ther
>
> From: "Finkel, Hal J." <hfinkel at anl.gov>
> To: Ye Luo <xw111luoye at gmail.com>
> Cc: "openmp-dev at lists.llvm.org" <openmp-dev at lists.llvm.org>, Alexey
> Bataev <a.bataev at hotmail.com>, Gheorghe-Teod Bercea <
> gheorghe-teod.bercea at ibm.com>, "Doerfert, Johannes" <jdoerfert at anl.gov>
> Date: 03/20/2019 01:13 PM
> Subject: Re: [Openmp-dev] OpenMP offload implicitly using streams
> ------------------------------
>
>
>
> Thanks, Ye. I suppose that I thought it always worked that way :-)
>
> Alexey, Doru, do you know if there's any semantic problem or other
> concerns with enabling this option and/or making it the default?
>
> -Hal
>
> On 3/20/19 11:32 AM, Ye Luo via Openmp-dev wrote:
>
> - Hi all,
> After going through the source, I didn't find CUDA stream support.
> Luckily, I only need to add
> #define CUDA_API_PER_THREAD_DEFAULT_STREAM
> before
> #include <cuda.h>
> in libomptarget/plugins/cuda/src/rtl.cpp
> Then the multiple target goes to different streams and may execute
> concurrently.
> #pragma omp parallel
> {
> #pragma omp target
> {
> //offload computation
> }
> }
> This is exactly I want.
>
> I know the XL compiler uses streams in a different way but achieves
> similar effects.
> Is there anyone working on using streams with openmp target in llvm?
> Will clang-ykt get something similar to XL and upstream to the
> mainline?
>
> If we just add #define CUDA_API_PER_THREAD_DEFAULT_STREAM in the cuda
> rtl, will it be a trouble?
> As a compiler user, I'd like to have a better solution rather than
> having a patch just for myself.
>
> Best,
> Ye
> ===================
> Ye Luo, Ph.D.
> Computational Science Division & Leadership Computing Facility
> Argonne National Laboratory
>
>
> Ye Luo <*xw111luoye at gmail.com* <xw111luoye at gmail.com>> 于2019年3月17日周日
> 下午2:26写道:
> - Hi,
> How to turn on streams when using OpenMP offload?
> When different host threads individually start target regions (even
> not using nowait). The offloaded computation goes to different CUDA streams
> and may execute concurrently. This is currently available in XL.
> With Clang, nvprof shows only the run only uses the default stream.
> Is there a way to do that with Clang?
> On the other hand,
> nvcc has option --*default*-*stream per*-*thread*
> I'm not familar with clang CUDA, is there a similar option?
> Best,
> Ye
> ===================
> Ye Luo, Ph.D.
> Computational Science Division & Leadership Computing Facility
> Argonne National Laboratory
>
> _______________________________________________
> Openmp-dev mailing list
> *Openmp-dev at lists.llvm.org* <Openmp-dev at lists.llvm.org>
> *https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev*
> <https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
>
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20190320/80f2d52e/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot from 2019-03-20 16:48:31.png
Type: image/png
Size: 73225 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20190320/80f2d52e/attachment-0001.png>
More information about the Openmp-dev
mailing list