[Openmp-dev] Potential offload kernel data race

Thu Jun 25 10:45:09 PDT 2020

Could you add documentation for this option?
fopenmp-cuda-parallel-target-region
The name doesn't reflect the actual effect.
The differential review mentions
```
Added support for dynamic memory allocation for globalized variables in
case if execution of target regions in parallel is required.
```
So what are the globalized variables? When do they occur? since it impacts
performance, application developers desire to know more details.
```

It might use "slow" allocation functions, in general, since it may use
malloc on the device side. Being disabled, it uses statically preallocated
memory, which might be faster, if parallel target regions are not required.

```
In the slow code path, does it always call malloc in the device or malloc
gets called when exceeding a size limit? Is the call explicitly generated
by clang or the cuda driver handles it?
I saw some UseSharedMemory flags in the review. Is the slow code path uses
CUDA shared memory and it may need global memory when the size limit is
exceeded?
In the fast code path, what is the scope of this statically preallocated
memory? Is it owned by the CUDA plugin?

Thanks,
Ye
===================
Ye Luo, Ph.D.
Computational Science Division & Leadership Computing Facility
Argonne National Laboratory

On Thu, Jun 25, 2020 at 9:44 AM Alexey.Bataev <a.bataev at outlook.com> wrote:

> Hi Ye, the patch is committed, you can check if the compiler works for you.
>
> -------------
> Best regards,
> Alexey Bataev
>
> 22.06.2020 1:29 PM, Ye Luo via Openmp-dev пишет:
>
> It is working well now. I tested both the miniapp and the full application.
> ===================
> Ye Luo, Ph.D.
> Computational Science Division & Leadership Computing Facility
> Argonne National Laboratory
>
>
> On Mon, Jun 22, 2020 at 11:56 AM Alexey.Bataev <a.bataev at outlook.com>
> wrote:
>
>> Hi Ye, could you try your code with the attached patch for the clang?
>>
>> -------------
>> Best regards,
>> Alexey Bataev
>>
>> 19.06.2020 6:45 PM, Ye Luo via Openmp-dev пишет:
>>
>> Yes. It makes the code working.
>> Do you have any idea about the code pattern?
>> Before the compiler fix gets in, I can put the workaround in the real
>> code which has multiple places similar to the miniapp.
>> Best,
>> Ye
>> ===================
>> Ye Luo, Ph.D.
>> Computational Science Division & Leadership Computing Facility
>> Argonne National Laboratory
>>
>>
>> On Fri, Jun 19, 2020 at 4:35 PM Alexey.Bataev <a.bataev at outlook.com>
>> wrote:
>>
>>> Try to use `omp parallel firstprivate(first, last,
>>> offload_scratch_iVP_ptr, psiinv_ptr, i)` as the inner parallel directive.
>>>
>>> -------------
>>> Best regards,
>>> Alexey Bataev
>>>
>>> 16.06.2020 9:26 PM, Ye Luo via Openmp-dev пишет:
>>>
>>> Hi all,
>>> I got libomptarget failure when offloading the same kernel function from
>>> multiple host threads. I reported the bug as
>>> https://bugs.llvm.org/show_bug.cgi?id=46257.
>>>
>>> After a bit investigation, I'm able to narrow down a bit the scope of
>>> the issue.
>>> The error from CUDA runtime is "an illegal memory access was
>>> encountered".
>>>
>>> A) On the libomptarget side
>>> In libomptarget/plugins/cuda/src/rtl.cpp, I added a few synchronization
>>> before and after cuLaunchKernel and a mutex to protect the kernel execution.
>>>
>>>     Err = cuCtxSynchronize();
>>>     if (!checkResult(Err, "Error before cuLaunchKernel\n"))
>>>       return OFFLOAD_FAIL;
>>>
>>>     //kernelmtx.lock();
>>>     Err = cuLaunchKernel(KernelInfo->Func, CudaBlocksPerGrid, /*
>>> gridDimY */ 1,
>>>                          /* gridDimZ */ 1, CudaThreadsPerBlock,
>>>                          /* blockDimY */ 1, /* blockDimZ */ 1,
>>>                          /* sharedMemBytes */ 0, Stream, &Args[0],
>>> nullptr);
>>>     if (!checkResult(Err, "Error returned from cuLaunchKernel\n"))
>>>       return OFFLOAD_FAIL;
>>>
>>>     Err = cuCtxSynchronize();
>>>     //kernelmtx.unlock();
>>>     if (!checkResult(Err, "Error after cuLaunchKernel\n"))
>>>       return OFFLOAD_FAIL;
>>>
>>> Without the mutex. the first error from all the threads is "Error after
>>> cuLaunchKernel". After enabling the mutex, the code runs well. If I move
>>> the mutex unlock() before the cuCtxSynchronize, the code still runs into
>>> error. So I think the error comes from kernel execution not something else.
>>>
>>> 2) on my application side, I tried to see what triggers the error. Once
>>> I commented out "omp parallel" in the hierarchical parallelism.
>>> https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258
>>> Multi host threads offloading start to run without error.
>>>
>>> I'd like to ask the wider community to see potential causes of this bug.
>>> Is there a thread unsafe state machine with a kernel generated from a
>>> target region? Or any other potential reason for the error?
>>>
>>> Best,
>>> Ye
>>> ===================
>>> Ye Luo, Ph.D.
>>> Computational Science Division & Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>> _______________________________________________
>>> Openmp-dev mailing listOpenmp-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>>>
>>>
>> _______________________________________________
>> Openmp-dev mailing listOpenmp-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>>
>>
> _______________________________________________
> Openmp-dev mailing listOpenmp-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20200625/fdbabb7c/attachment-0001.html>