[Openmp-dev] Potential offload kernel data race

Ye Luo via Openmp-dev openmp-dev at lists.llvm.org
Tue Jun 16 18:26:21 PDT 2020


Hi all,
I got libomptarget failure when offloading the same kernel function from
multiple host threads. I reported the bug as
https://bugs.llvm.org/show_bug.cgi?id=46257.

After a bit investigation, I'm able to narrow down a bit the scope of the
issue.
The error from CUDA runtime is "an illegal memory access was encountered".

A) On the libomptarget side
In libomptarget/plugins/cuda/src/rtl.cpp, I added a few synchronization
before and after cuLaunchKernel and a mutex to protect the kernel execution.

    Err = cuCtxSynchronize();
    if (!checkResult(Err, "Error before cuLaunchKernel\n"))
      return OFFLOAD_FAIL;

    //kernelmtx.lock();
    Err = cuLaunchKernel(KernelInfo->Func, CudaBlocksPerGrid, /* gridDimY
*/ 1,
                         /* gridDimZ */ 1, CudaThreadsPerBlock,
                         /* blockDimY */ 1, /* blockDimZ */ 1,
                         /* sharedMemBytes */ 0, Stream, &Args[0], nullptr);
    if (!checkResult(Err, "Error returned from cuLaunchKernel\n"))
      return OFFLOAD_FAIL;

    Err = cuCtxSynchronize();
    //kernelmtx.unlock();
    if (!checkResult(Err, "Error after cuLaunchKernel\n"))
      return OFFLOAD_FAIL;

Without the mutex. the first error from all the threads is "Error after
cuLaunchKernel". After enabling the mutex, the code runs well. If I move
the mutex unlock() before the cuCtxSynchronize, the code still runs into
error. So I think the error comes from kernel execution not something else.

2) on my application side, I tried to see what triggers the error. Once I
commented out "omp parallel" in the hierarchical parallelism.
https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258
Multi host threads offloading start to run without error.

I'd like to ask the wider community to see potential causes of this bug.
Is there a thread unsafe state machine with a kernel generated from a
target region? Or any other potential reason for the error?

Best,
Ye
===================
Ye Luo, Ph.D.
Computational Science Division & Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20200616/c377e3f4/attachment.html>


More information about the Openmp-dev mailing list