[Openmp-dev] Potential offload kernel data race

Thu Jun 25 12:54:22 PDT 2020

Thank you. I will try that.
Ye
===================
Ye Luo, Ph.D.
Computational Science Division & Leadership Computing Facility
Argonne National Laboratory

On Thu, Jun 25, 2020 at 2:16 PM Alexey.Bataev <a.bataev at outlook.com> wrote:

> int a;
>
> #pragma omp allocate(a) allocator(omp_pteam_mem_alloc)
>
> or for privatized variables #pragma omp ... private(a)
> allocate(omp_pteam_mem_alloc: a)
>
> But you should be very careful here too, it might lead to data race. The
> variables are still statically allocated and, probably, to avoid data race,
> you will need to create an array and allocate memory for the array of the
> variables.
>
> -------------
> Best regards,
> Alexey Bataev
>
> 25.06.2020 3:10 PM, Ye Luo пишет:
>
> Do you have an example using shared memory via #pragma omp allocate?
> Ye
> ===================
> Ye Luo, Ph.D.
> Computational Science Division & Leadership Computing Facility
> Argonne National Laboratory
>
>
> On Thu, Jun 25, 2020 at 2:09 PM Alexey.Bataev <a.bataev at outlook.com>
> wrote:
>
>> Yes, reduction variables also can be candidates for globalization if the
>> reduction variable must be shared in parallel region. You can try to
>> control it manually using allocate clause.
>>
>> -------------
>> Best regards,
>> Alexey Bataev
>>
>> 25.06.2020 3:05 PM, Ye Luo пишет:
>>
>> I cannot use openmp-cuda-mode. I don't remember now exactly where
>> (reduction?) but it breaks other things in my code.
>> Ye
>> ===================
>> Ye Luo, Ph.D.
>> Computational Science Division & Leadership Computing Facility
>> Argonne National Laboratory
>>
>>
>> On Thu, Jun 25, 2020 at 2:02 PM Alexey.Bataev <a.bataev at outlook.com>
>> wrote:
>>
>>> You can use shared memory directly by using #pragma omp allocate and
>>> allocate clauses. Plus, shared memory is actively used by the runtime. But
>>> I agree, that it would be good to allow to control the size.
>>>
>>> BTW, you can disable this globalization mechanism by using
>>> -fopenmp-cuda-mode option. In this case, you need to control sharing the
>>> variables yourself.
>>>
>>> -------------
>>> Best regards,
>>> Alexey Bataev
>>>
>>> 25.06.2020 2:51 PM, Ye Luo пишет:
>>>
>>> I think hard-coded size 128 is suboptimal. Can we allow users to select
>>> that?
>>> Since we cannot use shared memory directly, I believe there are plenty
>>> of shared memory available.
>>> It seems that my test code is exceeding 128 byte and I'd like to set a
>>> higher limit.
>>> Ye
>>> ===================
>>> Ye Luo, Ph.D.
>>> Computational Science Division & Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>>
>>> On Thu, Jun 25, 2020 at 1:02 PM Alexey Bataev <a.bataev at hotmail.com>
>>> wrote:
>>>
>>>> Globalized variables are the variables that must be shared between the
>>>> threads but initially allocated by the master thread. On GPU, local memory
>>>> is private for threads, and the local variables cannkt be shared between
>>>> the threads. But OpenMP requires that such variables must be shared between
>>>> the threads. So, if we see, that the local variable is allocated in the
>>>> context of the master thread and later is shared in parallel region (or
>>>> potentially can be shared, if it may escape its declaration context by the
>>>> reference or by the address), such variables are allocated either in shared
>>>> or in global memory. If the total size of globalized variables is less than
>>>> 128 bytes, they are allocated in shared memory. Otherwise, fallback to
>>>> malloc. If -fno-penmp-cuda-parallel-target-region is used, then again, if
>>>> total size <= 128 bytes - shared memory is used, otherwise statically
>>>> allocated global memory buffer is used.
>>>>
>>>> Best regards,
>>>> Alexey Bataev
>>>>
>>>> 25 июня 2020 г., в 13:45, Ye Luo <xw111luoye at gmail.com> написал(а):
>>>>
>>>> 
>>>> Could you add documentation for this option?
>>>> fopenmp-cuda-parallel-target-region
>>>> The name doesn't reflect the actual effect.
>>>> The differential review mentions
>>>> ```
>>>> Added support for dynamic memory allocation for globalized variables in
>>>> case if execution of target regions in parallel is required.
>>>> ```
>>>> So what are the globalized variables? When do they occur? since it
>>>> impacts performance, application developers desire to know more details.
>>>> ```
>>>>
>>>> It might use "slow" allocation functions, in general, since it may use
>>>> malloc on the device side. Being disabled, it uses statically preallocated
>>>> memory, which might be faster, if parallel target regions are not required.
>>>>
>>>> ```
>>>> In the slow code path, does it always call malloc in the device or
>>>> malloc gets called when exceeding a size limit? Is the call explicitly
>>>> generated by clang or the cuda driver handles it?
>>>> I saw some UseSharedMemory flags in the review. Is the slow code path
>>>> uses CUDA shared memory and it may need global memory when the size limit
>>>> is exceeded?
>>>> In the fast code path, what is the scope of this statically
>>>> preallocated memory? Is it owned by the CUDA plugin?
>>>>
>>>> Thanks,
>>>> Ye
>>>> ===================
>>>> Ye Luo, Ph.D.
>>>> Computational Science Division & Leadership Computing Facility
>>>> Argonne National Laboratory
>>>>
>>>>
>>>> On Thu, Jun 25, 2020 at 9:44 AM Alexey.Bataev <a.bataev at outlook.com>
>>>> wrote:
>>>>
>>>>> Hi Ye, the patch is committed, you can check if the compiler works for
>>>>> you.
>>>>>
>>>>> -------------
>>>>> Best regards,
>>>>> Alexey Bataev
>>>>>
>>>>> 22.06.2020 1:29 PM, Ye Luo via Openmp-dev пишет:
>>>>>
>>>>> It is working well now. I tested both the miniapp and the full
>>>>> application.
>>>>> ===================
>>>>> Ye Luo, Ph.D.
>>>>> Computational Science Division & Leadership Computing Facility
>>>>> Argonne National Laboratory
>>>>>
>>>>>
>>>>> On Mon, Jun 22, 2020 at 11:56 AM Alexey.Bataev <a.bataev at outlook.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Ye, could you try your code with the attached patch for the clang?
>>>>>>
>>>>>> -------------
>>>>>> Best regards,
>>>>>> Alexey Bataev
>>>>>>
>>>>>> 19.06.2020 6:45 PM, Ye Luo via Openmp-dev пишет:
>>>>>>
>>>>>> Yes. It makes the code working.
>>>>>> Do you have any idea about the code pattern?
>>>>>> Before the compiler fix gets in, I can put the workaround in the real
>>>>>> code which has multiple places similar to the miniapp.
>>>>>> Best,
>>>>>> Ye
>>>>>> ===================
>>>>>> Ye Luo, Ph.D.
>>>>>> Computational Science Division & Leadership Computing Facility
>>>>>> Argonne National Laboratory
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 19, 2020 at 4:35 PM Alexey.Bataev <a.bataev at outlook.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Try to use `omp parallel firstprivate(first, last,
>>>>>>> offload_scratch_iVP_ptr, psiinv_ptr, i)` as the inner parallel directive.
>>>>>>>
>>>>>>> -------------
>>>>>>> Best regards,
>>>>>>> Alexey Bataev
>>>>>>>
>>>>>>> 16.06.2020 9:26 PM, Ye Luo via Openmp-dev пишет:
>>>>>>>
>>>>>>> Hi all,
>>>>>>> I got libomptarget failure when offloading the same kernel function
>>>>>>> from multiple host threads. I reported the bug as
>>>>>>> https://bugs.llvm.org/show_bug.cgi?id=46257.
>>>>>>>
>>>>>>> After a bit investigation, I'm able to narrow down a bit the scope
>>>>>>> of the issue.
>>>>>>> The error from CUDA runtime is "an illegal memory access was
>>>>>>> encountered".
>>>>>>>
>>>>>>> A) On the libomptarget side
>>>>>>> In libomptarget/plugins/cuda/src/rtl.cpp, I added a few
>>>>>>> synchronization before and after cuLaunchKernel and a mutex to protect the
>>>>>>> kernel execution.
>>>>>>>
>>>>>>>     Err = cuCtxSynchronize();
>>>>>>>     if (!checkResult(Err, "Error before cuLaunchKernel\n"))
>>>>>>>       return OFFLOAD_FAIL;
>>>>>>>
>>>>>>>     //kernelmtx.lock();
>>>>>>>     Err = cuLaunchKernel(KernelInfo->Func, CudaBlocksPerGrid, /*
>>>>>>> gridDimY */ 1,
>>>>>>>                          /* gridDimZ */ 1, CudaThreadsPerBlock,
>>>>>>>                          /* blockDimY */ 1, /* blockDimZ */ 1,
>>>>>>>                          /* sharedMemBytes */ 0, Stream, &Args[0],
>>>>>>> nullptr);
>>>>>>>     if (!checkResult(Err, "Error returned from cuLaunchKernel\n"))
>>>>>>>       return OFFLOAD_FAIL;
>>>>>>>
>>>>>>>     Err = cuCtxSynchronize();
>>>>>>>     //kernelmtx.unlock();
>>>>>>>     if (!checkResult(Err, "Error after cuLaunchKernel\n"))
>>>>>>>       return OFFLOAD_FAIL;
>>>>>>>
>>>>>>> Without the mutex. the first error from all the threads is "Error
>>>>>>> after cuLaunchKernel". After enabling the mutex, the code runs well. If I
>>>>>>> move the mutex unlock() before the cuCtxSynchronize, the code still runs
>>>>>>> into error. So I think the error comes from kernel execution not something
>>>>>>> else.
>>>>>>>
>>>>>>> 2) on my application side, I tried to see what triggers the error.
>>>>>>> Once I commented out "omp parallel" in the hierarchical parallelism.
>>>>>>> https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258
>>>>>>> Multi host threads offloading start to run without error.
>>>>>>>
>>>>>>> I'd like to ask the wider community to see potential causes of this
>>>>>>> bug.
>>>>>>> Is there a thread unsafe state machine with a kernel generated from
>>>>>>> a target region? Or any other potential reason for the error?
>>>>>>>
>>>>>>> Best,
>>>>>>> Ye
>>>>>>> ===================
>>>>>>> Ye Luo, Ph.D.
>>>>>>> Computational Science Division & Leadership Computing Facility
>>>>>>> Argonne National Laboratory
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Openmp-dev mailing listOpenmp-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Openmp-dev mailing listOpenmp-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> Openmp-dev mailing listOpenmp-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>>>>>
>>>>> _______________________________________________
>>>> Openmp-dev mailing list
>>>> Openmp-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20200625/f5408b9d/attachment-0001.html>