<div dir="ltr"><div></div><div>I think hard-coded size 128 is suboptimal. Can we allow users to select that?</div><div>Since we cannot use shared memory directly, I believe there are plenty of shared memory available.</div><div>It seems that my test code is exceeding 128 byte and I'd like to set a higher limit.<br></div><div>Ye<br></div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr">===================<br>

Ye Luo, Ph.D.<br>Computational Science Division & Leadership Computing Facility<br>

Argonne National Laboratory</div></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020 at 1:02 PM Alexey Bataev <<a href="mailto:a.bataev@hotmail.com">a.bataev@hotmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div dir="auto">

Globalized variables are the variables that must be shared between the threads but initially allocated by the master thread. On GPU, local memory is private for threads, and the local variables cannkt be shared between the threads. But OpenMP requires that

 such variables must be shared between the threads. So, if we see, that the local variable is allocated in the context of the master thread and later is shared in parallel region (or potentially can be shared, if it may escape its declaration context by the

 reference or by the address), such variables are allocated either in shared or in global memory. If the total size of globalized variables is less than 128 bytes, they are allocated in shared memory. Otherwise, fallback to malloc. If -fno-penmp-cuda-parallel-target-region

 is used, then again, if total size <= 128 bytes - shared memory is used, otherwise statically allocated global memory buffer is used.<br>

<br>

<div dir="ltr">Best regards,

<div>Alexey Bataev</div>

</div>

<div dir="ltr"><br>

<blockquote type="cite">25 июня 2020 г., в 13:45, Ye Luo <<a href="mailto:xw111luoye@gmail.com" target="_blank">xw111luoye@gmail.com</a>> написал(а):<br>

<br>

</blockquote>

</div>

<blockquote type="cite">

<div dir="ltr">

<div dir="ltr">

<div>Could you add documentation for this option?</div>

<div><span><tt>fopenmp-cuda-parallel-target-region</tt></span></div>

<div><span><tt>The name doesn't reflect the actual effect.</tt></span></div>

<div><span><tt>The differential review mentions</tt></span></div>

<div><span><tt>```<br>

</tt></span></div>

<div><span><tt>Added support for dynamic memory allocation for globalized variables in<br>

case if execution of target regions in parallel is required.</tt></span></div>

<div><span><tt>```</tt></span></div>

<div><span><tt>So what are the globalized variables? When do they occur? since it impacts performance, application developers desire to know more details.<br>

</tt></span></div>

<div><span><tt>```<br>

</tt></span></div>

<div><span><tt><span>

<p>It might use "slow" allocation functions, in general, since it may use malloc on the device side. Being disabled, it uses statically preallocated memory, which might be faster, if parallel target regions are not required.</p>

<p>```</p>

In the slow code path, does it always call malloc in the device or malloc gets called when exceeding a size limit? Is the call explicitly generated by clang or the cuda driver handles it?<br>

</span></tt></span></div>

<div><span><tt><span>I saw some

<span>UseSharedMemory flags in the review. Is the slow code path uses CUDA shared memory and it may need global memory when the size limit is exceeded?<br>

</span></span></tt></span></div>

<div><span><tt><span>In the fast code path, what is the scope of this statically preallocated memory? Is it owned by the CUDA plugin?</span></tt></span></div>

<div><span><tt><span><br>

</span></tt></span></div>

<div><span><tt><span>Thanks,<br>

</span></tt></span></div>

<div><span><tt><span>Ye<br>

</span></tt></span></div>

<div>

<div>

<div dir="ltr">

<div dir="ltr">

<div>

<div dir="ltr">===================<br>

Ye Luo, Ph.D.<br>

Computational Science Division & Leadership Computing Facility<br>

Argonne National Laboratory</div>

</div>

</div>

</div>

</div>

<br>

</div>

</div>

<br>

<div class="gmail_quote">

<div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020 at 9:44 AM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>> wrote:<br>

</div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div>

<p>Hi Ye, the patch is committed, you can check if the compiler works for you.<br>

</p>

<pre cols="72">-------------

Best regards,

Alexey Bataev</pre>

<div>22.06.2020 1:29 PM, Ye Luo via Openmp-dev пишет:<br>

</div>

<blockquote type="cite">

<div dir="ltr">It is working well now. I tested both the miniapp and the full application.<br clear="all">

<div>

<div>

<div dir="ltr">

<div dir="ltr">

<div>

<div dir="ltr">===================<br>

Ye Luo, Ph.D.<br>

Computational Science Division & Leadership Computing Facility<br>

Argonne National Laboratory</div>

</div>

</div>

</div>

</div>

<br>

</div>

</div>

<br>

<div class="gmail_quote">

<div dir="ltr" class="gmail_attr">On Mon, Jun 22, 2020 at 11:56 AM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>> wrote:<br>

</div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div>

<p>Hi Ye, could you try your code with the attached patch for the clang? <br>

</p>

<pre cols="72">-------------

Best regards,

Alexey Bataev</pre>

<div>19.06.2020 6:45 PM, Ye Luo via Openmp-dev пишет:<br>

</div>

<blockquote type="cite">

<div dir="ltr">

<div>Yes. It makes the code working.</div>

<div>Do you have any idea about the code pattern?<br>

</div>

<div>Before the compiler fix gets in, I can put the workaround in the real code which has multiple places similar to the miniapp.</div>

<div>Best,<br>

</div>

<div>Ye<br>

</div>

<div>

<div>

<div dir="ltr">

<div dir="ltr">

<div>

<div dir="ltr">===================<br>

Ye Luo, Ph.D.<br>

Computational Science Division & Leadership Computing Facility<br>

Argonne National Laboratory</div>

</div>

</div>

</div>

</div>

<br>

</div>

</div>

<br>

<div class="gmail_quote">

<div dir="ltr" class="gmail_attr">On Fri, Jun 19, 2020 at 4:35 PM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>> wrote:<br>

</div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div>

<p>Try to use `omp parallel firstprivate(first, last, offload_scratch_iVP_ptr, psiinv_ptr, i)` as the inner parallel directive.<br>

</p>

<pre cols="72">-------------

Best regards,

Alexey Bataev</pre>

<div>16.06.2020 9:26 PM, Ye Luo via Openmp-dev пишет:<br>

</div>

<blockquote type="cite">

<div dir="ltr">

<div>Hi all,</div>

<div>I got libomptarget failure when offloading the same kernel function from multiple host threads. I reported the bug as

<a href="https://bugs.llvm.org/show_bug.cgi?id=46257" target="_blank">https://bugs.llvm.org/show_bug.cgi?id=46257</a>.</div>

<div><br>

</div>

<div>After a bit investigation, I'm able to narrow down a bit the scope of the issue.</div>

<div>The error from CUDA runtime is "an illegal memory access was encountered".</div>

<div><br>

</div>

<div>

<div>A) On the libomptarget side<br>

</div>

</div>

<div>In libomptarget/plugins/cuda/src/rtl.cpp, I added a few synchronization before and after cuLaunchKernel and a mutex to protect the kernel execution.<br>

</div>

<div><br>

</div>

<div>    Err = cuCtxSynchronize();<br>

    if (!checkResult(Err, "Error before cuLaunchKernel\n"))<br>

      return OFFLOAD_FAIL;<br>

<br>

    //kernelmtx.lock();<br>

    Err = cuLaunchKernel(KernelInfo->Func, CudaBlocksPerGrid, /* gridDimY */ 1,<br>

                         /* gridDimZ */ 1, CudaThreadsPerBlock,<br>

                         /* blockDimY */ 1, /* blockDimZ */ 1,<br>

                         /* sharedMemBytes */ 0, Stream, &Args[0], nullptr);<br>

    if (!checkResult(Err, "Error returned from cuLaunchKernel\n"))<br>

      return OFFLOAD_FAIL;<br>

<br>

    Err = cuCtxSynchronize();<br>

    //kernelmtx.unlock();<br>

    if (!checkResult(Err, "Error after cuLaunchKernel\n"))<br>

      return OFFLOAD_FAIL;</div>

<div><br>

</div>

<div>Without the mutex. the first error from all the threads is "Error after cuLaunchKernel". After enabling the mutex, the code runs well. If I move the mutex unlock() before the cuCtxSynchronize, the code still runs into error. So I think the error comes

 from kernel execution not something else.</div>

<div><br>

</div>

<div>2) on my application side, I tried to see what triggers the error. Once I commented out "omp parallel" in the hierarchical parallelism.

<a href="https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258" target="_blank">

https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258</a><br>

</div>

<div>Multi host threads offloading start to run without error.</div>

<div><br>

</div>

<div>I'd like to ask the wider community to see potential causes of this bug.<br>

</div>

<div>Is there a thread unsafe state machine with a kernel generated from a target region? Or any other potential reason for the error?</div>

<div><br>

</div>

<div>Best,</div>

<div>Ye<br>

</div>

<div>

<div dir="ltr">

<div dir="ltr">

<div>

<div dir="ltr">===================<br>

Ye Luo, Ph.D.<br>

Computational Science Division & Leadership Computing Facility<br>

Argonne National Laboratory</div>

</div>

</div>

</div>

</div>

</div>

<br>

<fieldset></fieldset>

<pre>_______________________________________________

Openmp-dev mailing list

<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>

</pre>

</blockquote>

</div>

</blockquote>

</div>

<br>

<fieldset></fieldset>

<pre>_______________________________________________

Openmp-dev mailing list

<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>

</pre>

</blockquote>

</div>

</blockquote>

</div>

<br>

<fieldset></fieldset>

<pre>_______________________________________________

Openmp-dev mailing list

<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>

</pre>

</blockquote>

</div>

</blockquote>

</div>

<span>_______________________________________________</span><br>

<span>Openmp-dev mailing list</span><br>

<span><a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a></span><br>

<span><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a></span><br>

</div>

</blockquote>

</div>

</blockquote></div>