[cfe-dev] openmp 4.5 and cuda streams

Thu Oct 31 12:58:28 PDT 2019

On 10/31/19 10:54 AM, Luo, Ye wrote:
Hi Hal,
My experience of llvm/clang so far shows:
1. all the target offload is blocking synchronous using the default stream. nowait is not supported.
2. all the memory transfer calls invoke cudaMemcpy. There are no async calls.
3. I had an experiment in the past turning on CUDA_API_PER_THREAD_DEFAULT_STREAM in libomptarget.
Then I use multiple host threads to do individual blocking synchronous offload. I got it sort of running and saw multple streams but the code crashes due to memory corruption probably due to some data race in libomptarget.

Thanks, Ye. That's consistent with Alexey's comments.

Is there already a bug open on this? If not, we should open one.

Alexey, the buffer-reuse optimizations in Clang that you mentioned, how much memory/overhead do they save? Is it worth keeping them in some mode?

 -Hal

Best,
Ye

________________________________
From: Finkel, Hal J. <hfinkel at anl.gov><mailto:hfinkel at anl.gov>
Sent: Wednesday, October 30, 2019 1:40 PM
To: Alessandro Gabbana <gbblsn at unife.it><mailto:gbblsn at unife.it>; cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org> <cfe-dev at lists.llvm.org><mailto:cfe-dev at lists.llvm.org>; Luo, Ye <yeluo at anl.gov><mailto:yeluo at anl.gov>; Doerfert, Johannes <jdoerfert at anl.gov><mailto:jdoerfert at anl.gov>
Subject: Re: [cfe-dev] openmp 4.5 and cuda streams

[+Ye, Johannes]

I recall that we've also observed this behavior. Ye, Johannes, we had a
work-around and a patch, correct?

  -Hal

On 10/30/19 12:28 PM, Alessandro Gabbana via cfe-dev wrote:
> Dear All,
>
> I'm using clang 9.0.0 to compile a code which offloads sections of a
> code on a GPU using the openmp target construct.
> I also use the nowait clause to overlap the execution of certain
> kernels and/or host<->device memory transfers.
> However, using the nvidia profiler I've noticed that when I compile
> the code with clang only one cuda stream is active,
> and therefore the execution gets serialized. On the other hand, when
> compiling with XLC I see that kernels are executed
> on different streams. I could not understand if this is the expected
> behavior (e.g. the nowait clause is currently not supported),
> or if I'm missing something. I'm using a NVIDIA Tesla P100 GPU and
> compiling with the following options:
>
> -target x86_64-pc-linux-gnu -fopenmp
> -fopenmp-targets=nvptx64-nvidia-cuda
> -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_60
>
> best wishes
>
> Alessandro
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191031/de2b3315/attachment.html>