<div dir="ltr"><div>Could you add documentation for this option?</div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced">fopenmp-cuda-parallel-target-region</tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced">The name doesn't reflect the actual effect.</tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced">The differential review mentions</tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced">```<br></tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced">Added support for dynamic memory allocation for globalized variables in<br>

case if execution of target regions in parallel is required.</tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced">```</tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced">So what are the globalized variables? When do they occur? since it impacts performance, application developers desire to know more details.<br></tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced">```<br></tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced"><span class="gmail-transaction-comment"><p>It

 might use "slow" allocation functions, in general, since it may use 

malloc on the device side. Being disabled, it uses statically 

preallocated memory, which might be faster, if parallel target regions 

are not required.</p><p>```</p>In the slow code path, does it always call malloc in the device or malloc gets called when exceeding a size limit? Is the call explicitly generated by clang or the cuda driver handles it?<br></span></tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced"><span class="gmail-transaction-comment">I saw some <span class="gmail-c1">UseSharedMemory flags in the review. Is the slow code path uses CUDA shared memory and it may need global memory when the size limit is exceeded?<br></span></span></tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced"><span class="gmail-transaction-comment">In the fast code path, what is the scope of this statically 

preallocated memory? Is it owned by the CUDA plugin?</span></tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced"><span class="gmail-transaction-comment"><br></span></tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced"><span class="gmail-transaction-comment">Thanks,<br></span></tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced"><span class="gmail-transaction-comment">Ye<br></span></tt></span></div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr">===================<br>

Ye Luo, Ph.D.<br>Computational Science Division & Leadership Computing Facility<br>

Argonne National Laboratory</div></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020 at 9:44 AM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com">a.bataev@outlook.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div>

    <p>Hi Ye, the patch is committed, you can check if the compiler

      works for you.<br>

    </p>

    <pre cols="72">-------------

Best regards,

Alexey Bataev</pre>

    <div>22.06.2020 1:29 PM, Ye Luo via

      Openmp-dev пишет:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">It is working well now. I tested both the miniapp

        and the full application.<br clear="all">

        <div>

          <div>

            <div dir="ltr">

              <div dir="ltr">

                <div>

                  <div dir="ltr">===================<br>

                    Ye Luo, Ph.D.<br>

                    Computational Science Division & Leadership

                    Computing Facility<br>

                    Argonne National Laboratory</div>

                </div>

              </div>

            </div>

          </div>

          <br>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Mon, Jun 22, 2020 at 11:56

          AM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>> wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

          <div>

            <p>Hi Ye, could you try your code with the attached patch

              for the clang? <br>

            </p>

            <pre cols="72">-------------

Best regards,

Alexey Bataev</pre>

            <div>19.06.2020 6:45 PM, Ye Luo via Openmp-dev пишет:<br>

            </div>

            <blockquote type="cite">

              <div dir="ltr">

                <div>Yes. It makes the code working.</div>

                <div>Do you have any idea about the code pattern?<br>

                </div>

                <div>Before the compiler fix gets in, I can put the

                  workaround in the real code which has multiple places

                  similar to the miniapp.</div>

                <div>Best,<br>

                </div>

                <div>Ye<br>

                </div>

                <div>

                  <div>

                    <div dir="ltr">

                      <div dir="ltr">

                        <div>

                          <div dir="ltr">===================<br>

                            Ye Luo, Ph.D.<br>

                            Computational Science Division &

                            Leadership Computing Facility<br>

                            Argonne National Laboratory</div>

                        </div>

                      </div>

                    </div>

                  </div>

                  <br>

                </div>

              </div>

              <br>

              <div class="gmail_quote">

                <div dir="ltr" class="gmail_attr">On Fri, Jun 19, 2020

                  at 4:35 PM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>

                  wrote:<br>

                </div>

                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                  <div>

                    <p>Try to use `omp parallel firstprivate(first,

                      last, offload_scratch_iVP_ptr, psiinv_ptr, i)` as

                      the inner parallel directive.<br>

                    </p>

                    <pre cols="72">-------------

Best regards,

Alexey Bataev</pre>

                    <div>16.06.2020 9:26 PM, Ye Luo via Openmp-dev

                      пишет:<br>

                    </div>

                    <blockquote type="cite">

                      <div dir="ltr">

                        <div>Hi all,</div>

                        <div>I got libomptarget failure when offloading

                          the same kernel function from multiple host

                          threads. I reported the bug as <a href="https://bugs.llvm.org/show_bug.cgi?id=46257" target="_blank">https://bugs.llvm.org/show_bug.cgi?id=46257</a>.</div>

                        <div><br>

                        </div>

                        <div>After a bit investigation, I'm able to

                          narrow down a bit the scope of the issue.</div>

                        <div>The error from CUDA runtime is "an illegal

                          memory access was encountered".</div>

                        <div><br>

                        </div>

                        <div>

                          <div>A) On the libomptarget side<br>

                          </div>

                        </div>

                        <div>In libomptarget/plugins/cuda/src/rtl.cpp, I

                          added a few synchronization before and after

                          cuLaunchKernel and a mutex to protect the

                          kernel execution.<br>

                        </div>

                        <div><br>

                        </div>

                        <div>    Err = cuCtxSynchronize();<br>

                              if (!checkResult(Err, "Error before

                          cuLaunchKernel\n"))<br>

                                return OFFLOAD_FAIL;<br>

                          <br>

                              //kernelmtx.lock();<br>

                              Err = cuLaunchKernel(KernelInfo->Func,

                          CudaBlocksPerGrid, /* gridDimY */ 1,<br>

                                                   /* gridDimZ */ 1,

                          CudaThreadsPerBlock,<br>

                                                   /* blockDimY */ 1, /*

                          blockDimZ */ 1,<br>

                                                   /* sharedMemBytes */

                          0, Stream, &Args[0], nullptr);<br>

                              if (!checkResult(Err, "Error returned from

                          cuLaunchKernel\n"))<br>

                                return OFFLOAD_FAIL;<br>

                          <br>

                              Err = cuCtxSynchronize();<br>

                              //kernelmtx.unlock();<br>

                              if (!checkResult(Err, "Error after

                          cuLaunchKernel\n"))<br>

                                return OFFLOAD_FAIL;</div>

                        <div><br>

                        </div>

                        <div>Without the mutex. the first error from all

                          the threads is "Error after cuLaunchKernel".

                          After enabling the mutex, the code runs well.

                          If I move the mutex unlock() before the

                          cuCtxSynchronize, the code still runs into

                          error. So I think the error comes from kernel

                          execution not something else.</div>

                        <div><br>

                        </div>

                        <div>2) on my application side, I tried to see

                          what triggers the error. Once I commented out

                          "omp parallel" in the hierarchical

                          parallelism. <a href="https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258" target="_blank">https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258</a><br>

                        </div>

                        <div>Multi host threads offloading start to run

                          without error.</div>

                        <div><br>

                        </div>

                        <div>I'd like to ask the wider community to see

                          potential causes of this bug.<br>

                        </div>

                        <div>Is there a thread unsafe state machine with

                          a kernel generated from a target region? Or

                          any other potential reason for the error?</div>

                        <div><br>

                        </div>

                        <div>Best,</div>

                        <div>Ye<br>

                        </div>

                        <div>

                          <div dir="ltr">

                            <div dir="ltr">

                              <div>

                                <div dir="ltr">===================<br>

                                  Ye Luo, Ph.D.<br>

                                  Computational Science Division &

                                  Leadership Computing Facility<br>

                                  Argonne National Laboratory</div>

                              </div>

                            </div>

                          </div>

                        </div>

                      </div>

                      <br>

                      <fieldset></fieldset>

                      <pre>_______________________________________________

Openmp-dev mailing list

<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>

</pre>

                    </blockquote>

                  </div>

                </blockquote>

              </div>

              <br>

              <fieldset></fieldset>

              <pre>_______________________________________________

Openmp-dev mailing list

<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>

</pre>

            </blockquote>

          </div>

        </blockquote>

      </div>

      <br>

      <fieldset></fieldset>

      <pre>_______________________________________________

Openmp-dev mailing list

<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>

</pre>

    </blockquote>

  </div>

</blockquote></div>