[cfe-dev] openmp 4.5 and cuda streams

Thu Oct 31 13:38:27 PDT 2019

Not sure about the API, most probably just some internal work is
required. Better to ask Alex Eichenberger, he knows more about this.

-------------
Best regards,
Alexey Bataev

31.10.2019 4:36 PM, Finkel, Hal J. пишет:
>
>
> On 10/31/19 3:06 PM, Alexey Bataev wrote:
>>
>> Hope to send this message from the main dev e-mail this time :)
>>
>>
>> Well, about the memory. It depends on the number of kernels you have.
>> All the memory in the kernels that must be globalized is squashed
>> into a union. With streams we need to use the separate structure for
>> each particular kernel. Plus, we cannot use shared memory for this
>> buffer anymore again because of possible conflict.
>>
>>
>> We can add a new compiler option to compile only some files with
>> streams support and use unique memory buffer for the globalized
>> variables. Plus, some work in the libomptarget is required, of course.
>>
>
> Do we also need some kind of libomptarget API change in order to
> communicate the fact that it's allowed to run multiple target regions
> concurrently?
>
>
> Thanks again,
>
> Hal
>
>
>>
>> -------------
>> Best regards,
>> Alexey Bataev
>> 31.10.2019 3:58 PM, Finkel, Hal J. пишет:
>>>
>>>
>>> On 10/31/19 10:54 AM, Luo, Ye wrote:
>>>> Hi Hal,
>>>> My experience of llvm/clang so far shows:
>>>> 1. all the target offload is blocking synchronous using the default
>>>> stream. nowait is not supported.
>>>> 2. all the memory transfer calls invoke cudaMemcpy. There are no
>>>> async calls.
>>>> 3. I had an experiment in the past turning on
>>>> CUDA_API_PER_THREAD_DEFAULT_STREAM in libomptarget.
>>>> Then I use multiple host threads to do individual blocking
>>>> synchronous offload. I got it sort of running and saw multple
>>>> streams but the code crashes due to memory corruption probably due
>>>> to some data race in libomptarget.
>>>
>>>
>>> Thanks, Ye. That's consistent with Alexey's comments.
>>>
>>>
>>> Is there already a bug open on this? If not, we should open one.
>>>
>>>
>>> Alexey, the buffer-reuse optimizations in Clang that you mentioned,
>>> how much memory/overhead do they save? Is it worth keeping them in
>>> some mode?
>>>
>>>
>>>  -Hal
>>>
>>>
>>>> Best,
>>>> Ye
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* Finkel, Hal J. <hfinkel at anl.gov>
>>>> *Sent:* Wednesday, October 30, 2019 1:40 PM
>>>> *To:* Alessandro Gabbana <gbblsn at unife.it>; cfe-dev at lists.llvm.org
>>>> <cfe-dev at lists.llvm.org>; Luo, Ye <yeluo at anl.gov>; Doerfert,
>>>> Johannes <jdoerfert at anl.gov>
>>>> *Subject:* Re: [cfe-dev] openmp 4.5 and cuda streams
>>>>  
>>>> [+Ye, Johannes]
>>>>
>>>> I recall that we've also observed this behavior. Ye, Johannes, we
>>>> had a
>>>> work-around and a patch, correct?
>>>>
>>>>   -Hal
>>>>
>>>> On 10/30/19 12:28 PM, Alessandro Gabbana via cfe-dev wrote:
>>>> > Dear All,
>>>> >
>>>> > I'm using clang 9.0.0 to compile a code which offloads sections of a
>>>> > code on a GPU using the openmp target construct.
>>>> > I also use the nowait clause to overlap the execution of certain
>>>> > kernels and/or host<->device memory transfers.
>>>> > However, using the nvidia profiler I've noticed that when I compile
>>>> > the code with clang only one cuda stream is active,
>>>> > and therefore the execution gets serialized. On the other hand, when
>>>> > compiling with XLC I see that kernels are executed
>>>> > on different streams. I could not understand if this is the expected
>>>> > behavior (e.g. the nowait clause is currently not supported),
>>>> > or if I'm missing something. I'm using a NVIDIA Tesla P100 GPU and
>>>> > compiling with the following options:
>>>> >
>>>> > -target x86_64-pc-linux-gnu -fopenmp
>>>> > -fopenmp-targets=nvptx64-nvidia-cuda
>>>> > -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_60
>>>> >
>>>> > best wishes
>>>> >
>>>> > Alessandro
>>>> >
>>>> > _______________________________________________
>>>> > cfe-dev mailing list
>>>> > cfe-dev at lists.llvm.org
>>>> > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>
>>>> -- 
>>>> Hal Finkel
>>>> Lead, Compiler Technology and Programming Languages
>>>> Leadership Computing Facility
>>>> Argonne National Laboratory
>>>>
>>> -- 
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191031/1a2c396a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191031/1a2c396a/attachment.sig>