[Openmp-dev] OpenMP offload implicitly using streams
Ye Luo via Openmp-dev
openmp-dev at lists.llvm.org
Wed Mar 20 12:28:39 PDT 2019
Thank you Alex for the explanation. It is a very nice read to understand
how to fully async offload working. I believe this is the eventual goal.
But it takes time to get in.
At the moment, I'm looking for a way which just needs only blocking offload
but gets concurrent execution.
My code pattern is
#pragma omp parallel
{ // parallel is at the very high level of the code hierarchy and contains
loops over target regions
#pragma omp target
{ // target is very local
//offload computation
}
}
Using 1 stream per host thread easily achieves concurrent execution.
I noticed some presentations showing a pipelining offload example using
nowait
http://on-demand.gputechconf.com/gtc/2018/presentation/s8344-openmp-on-gpus-first-experiences-and-best-practices.pdf
slide 30.
If we enable one stream per thread. The pipelining can work as well even
with blocking offload.
#pragma omp parallel for
for(int i=0; i<nblocks; i++)
{
#pragma omp target
{
//offload the computation of one block
}
}
So I'm thinking of turning on one stream per thread to gain something but
I'm not aware of any negative side.
Best,
Ye
===================
Ye Luo, Ph.D.
Computational Science Division & Leadership Computing Facility
Argonne National Laboratory
Alexandre Eichenberger <alexe at us.ibm.com> 于2019年3月20日周三 下午1:26写道:
> Hal,
>
> Supporting async for targets is not trivial. Right now, since everything
> is "blocking" a target can be easily separated into 5 tasks:
>
>
> 1. wait for dependences to resolve
> 2. perform all the copy to the device
> 3. execute the target
> 4. perform all the copy from device to host
> 5. resolve all dependences for other tasks
>
> Going async ala LOMP has many advantages: all operations are asynchronous,
> and dependences from target to target tasks are directly enforced on the
> device. To us, this was the only way that users could effectively hide the
> high overhead of generating targets, by enqueuing many dependent target
> tasks on the host, to prime the pipe of targets on the devices.
>
> To do so, I believe you have to do the following.
>
> 1. wait for all the host dependences; dependences from other device
> targets are tabulated but not waited for
> 2. select a stream (of one stream from a dependent target, if any)
> and enqueue wait for event for all tabulated dependences
> 3. enqueue all copy to device on stream (or enqueue sync event for
> data currently being copied over by other targets)
> 4. enqueue computation on stream
> 5. enqueue all copy from device on stream (this is speculative, as ref
> count may increase by another target executed before the data is actually
> copied back, but it's legal)
> 6. cleanup
> 1. blocking: wait for stream to be finished
> 2. non-blocking: have a callback from CUDA (which involve a
> separate thread) or have active polling by OpenMP threads when doing
> nothing and/or before doing a subsequent target task to determine when
> stream is finished
> 3. when 1 or 2 above are finished, cleanup the map data structures,
> resolve dependences for dependent tasks.
>
> This is compounded by the fact that async data movements are only
> performed with pinned memory, and any CUDA memory cannot be allocated
> directly as it is a synchronizing event. So runtime must handle it's own
> pool of device and pinned memory, which requires additional work in Steps
> 3, 5, and 6.3 above.
>
> To perform the cleanup in Step 6, you also need to cache all info
> associated with a target in a dedicated data structure.
>
> As you may have noticed, if you want some async to work, you have
> basically to treat all target as async; the synchronous ones differ only
> by having an explicit wait in Step 6.1. So all this handling is in the
> critical path.
>
> You will also need to carefully managed CUDA events associated with any
> explicit data movements, as subsequent target operations may be dependent
> on an actual memory operation to complete (in either directions).
>
> This has been done in LOMP, was it fun, maybe not, but it's all feasible.
>
> There is a possible saving grace, namely that you could implement async
> only under unified memory, which would simplify greatly the whole thing:
> eliminate Steps 3 & 5 above and associated bookkeeping.
>
> However, most application writers that have optimized their code will tell
> you that unified-only program tend not to work too well, and that hybrid
> models (copy predictable data, use unified for unstructured data) is likely
> to deliver better performance. So you could simplify your implementation at
> the cost of precluding async for the most optimized programs.
>
> Happy to discuss it further, and explore with you alternative
> implementations.
>
> Alexandre
>
>
> -----------------------------------------------------------------------------------------------------
> Alexandre Eichenberger, Principal RSM, Advanced Compiler Technologies
> - research: compiler optimization (OpenMP, GPU, SIMD)
> - info: alexe at us.ibm.com http://www.research.ibm.com/people/a/alexe
> - phone: 914-945-1812 (work), 914-312-3618 (cell)
>
>
>
> ----- Original message -----
> From: Gheorghe-Teod Bercea/US/IBM
> To: "Finkel, Hal J." <hfinkel at anl.gov>
> Cc: Alexey Bataev <a.bataev at hotmail.com>, "Doerfert, Johannes" <
> jdoerfert at anl.gov>, "openmp-dev at lists.llvm.org" <openmp-dev at lists.llvm.org>,
> Ye Luo <xw111luoye at gmail.com>, Alexandre Eichenberger/Watson/IBM at IBMUS
> Subject: Re: [Openmp-dev] OpenMP offload implicitly using streams
> Date: Wed, Mar 20, 2019 1:49 PM
>
> I'm adding Alex to this thread. He should be able to shed some light on
> this issue.
>
> Thanks,
>
> --Doru
>
>
> [image: Inactive hide details for "Finkel, Hal J." ---03/20/2019 01:13:33
> PM---Thanks, Ye. I suppose that I thought it always worked th]"Finkel,
> Hal J." ---03/20/2019 01:13:33 PM---Thanks, Ye. I suppose that I thought it
> always worked that way :-) Alexey, Doru, do you know if ther
>
> From: "Finkel, Hal J." <hfinkel at anl.gov>
> To: Ye Luo <xw111luoye at gmail.com>
> Cc: "openmp-dev at lists.llvm.org" <openmp-dev at lists.llvm.org>, Alexey
> Bataev <a.bataev at hotmail.com>, Gheorghe-Teod Bercea <
> gheorghe-teod.bercea at ibm.com>, "Doerfert, Johannes" <jdoerfert at anl.gov>
> Date: 03/20/2019 01:13 PM
> Subject: Re: [Openmp-dev] OpenMP offload implicitly using streams
> ------------------------------
>
>
>
> Thanks, Ye. I suppose that I thought it always worked that way :-)
>
> Alexey, Doru, do you know if there's any semantic problem or other
> concerns with enabling this option and/or making it the default?
>
> -Hal
>
> On 3/20/19 11:32 AM, Ye Luo via Openmp-dev wrote:
>
> - Hi all,
> After going through the source, I didn't find CUDA stream support.
> Luckily, I only need to add
> #define CUDA_API_PER_THREAD_DEFAULT_STREAM
> before
> #include <cuda.h>
> in libomptarget/plugins/cuda/src/rtl.cpp
> Then the multiple target goes to different streams and may execute
> concurrently.
> #pragma omp parallel
> {
> #pragma omp target
> {
> //offload computation
> }
> }
> This is exactly I want.
>
> I know the XL compiler uses streams in a different way but achieves
> similar effects.
> Is there anyone working on using streams with openmp target in llvm?
> Will clang-ykt get something similar to XL and upstream to the
> mainline?
>
> If we just add #define CUDA_API_PER_THREAD_DEFAULT_STREAM in the cuda
> rtl, will it be a trouble?
> As a compiler user, I'd like to have a better solution rather than
> having a patch just for myself.
>
> Best,
> Ye
> ===================
> Ye Luo, Ph.D.
> Computational Science Division & Leadership Computing Facility
> Argonne National Laboratory
>
>
> Ye Luo <*xw111luoye at gmail.com* <xw111luoye at gmail.com>> 于2019年3月17日周日
> 下午2:26写道:
> - Hi,
> How to turn on streams when using OpenMP offload?
> When different host threads individually start target regions (even
> not using nowait). The offloaded computation goes to different CUDA streams
> and may execute concurrently. This is currently available in XL.
> With Clang, nvprof shows only the run only uses the default stream.
> Is there a way to do that with Clang?
> On the other hand,
> nvcc has option --*default*-*stream per*-*thread*
> I'm not familar with clang CUDA, is there a similar option?
> Best,
> Ye
> ===================
> Ye Luo, Ph.D.
> Computational Science Division & Leadership Computing Facility
> Argonne National Laboratory
>
> _______________________________________________
> Openmp-dev mailing list
> *Openmp-dev at lists.llvm.org* <Openmp-dev at lists.llvm.org>
> *https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev*
> <https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20190320/960e8f72/attachment.html>
More information about the Openmp-dev
mailing list