<div dir="ltr"><div>Could you add documentation for this option?</div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced">fopenmp-cuda-parallel-target-region</tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced">The name doesn't reflect the actual effect.</tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced">The differential review mentions</tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced">```<br></tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced">Added support for dynamic memory allocation for globalized variables in<br>
case if execution of target regions in parallel is required.</tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced">```</tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced">So what are the globalized variables? When do they occur? since it impacts performance, application developers desire to know more details.<br></tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced">```<br></tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced"><span class="gmail-transaction-comment"><p>It
 might use "slow" allocation functions, in general, since it may use 
malloc on the device side. Being disabled, it uses statically 
preallocated memory, which might be faster, if parallel target regions 
are not required.</p><p>```</p>In the slow code path, does it always call malloc in the device or malloc gets called when exceeding a size limit? Is the call explicitly generated by clang or the cuda driver handles it?<br></span></tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced"><span class="gmail-transaction-comment">I saw some <span class="gmail-c1">UseSharedMemory flags in the review. Is the slow code path uses CUDA shared memory and it may need global memory when the size limit is exceeded?<br></span></span></tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced"><span class="gmail-transaction-comment">In the fast code path, what is the scope of this statically 
preallocated memory? Is it owned by the CUDA plugin?</span></tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced"><span class="gmail-transaction-comment"><br></span></tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced"><span class="gmail-transaction-comment">Thanks,<br></span></tt></span></div><div><span class="gmail-transaction-comment"><tt class="gmail-remarkup-monospaced"><span class="gmail-transaction-comment">Ye<br></span></tt></span></div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr">===================<br>
Ye Luo, Ph.D.<br>Computational Science Division & Leadership Computing Facility<br>
Argonne National Laboratory</div></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020 at 9:44 AM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com">a.bataev@outlook.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  
  <div>
    <p>Hi Ye, the patch is committed, you can check if the compiler
      works for you.<br>
    </p>
    <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
    <div>22.06.2020 1:29 PM, Ye Luo via
      Openmp-dev пишет:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">It is working well now. I tested both the miniapp
        and the full application.<br clear="all">
        <div>
          <div>
            <div dir="ltr">
              <div dir="ltr">
                <div>
                  <div dir="ltr">===================<br>
                    Ye Luo, Ph.D.<br>
                    Computational Science Division & Leadership
                    Computing Facility<br>
                    Argonne National Laboratory</div>
                </div>
              </div>
            </div>
          </div>
          <br>
        </div>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Mon, Jun 22, 2020 at 11:56
          AM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>> wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div>
            <p>Hi Ye, could you try your code with the attached patch
              for the clang? <br>
            </p>
            <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
            <div>19.06.2020 6:45 PM, Ye Luo via Openmp-dev пишет:<br>
            </div>
            <blockquote type="cite">
              <div dir="ltr">
                <div>Yes. It makes the code working.</div>
                <div>Do you have any idea about the code pattern?<br>
                </div>
                <div>Before the compiler fix gets in, I can put the
                  workaround in the real code which has multiple places
                  similar to the miniapp.</div>
                <div>Best,<br>
                </div>
                <div>Ye<br>
                </div>
                <div>
                  <div>
                    <div dir="ltr">
                      <div dir="ltr">
                        <div>
                          <div dir="ltr">===================<br>
                            Ye Luo, Ph.D.<br>
                            Computational Science Division &
                            Leadership Computing Facility<br>
                            Argonne National Laboratory</div>
                        </div>
                      </div>
                    </div>
                  </div>
                  <br>
                </div>
              </div>
              <br>
              <div class="gmail_quote">
                <div dir="ltr" class="gmail_attr">On Fri, Jun 19, 2020
                  at 4:35 PM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>
                  wrote:<br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                  <div>
                    <p>Try to use `omp parallel firstprivate(first,
                      last, offload_scratch_iVP_ptr, psiinv_ptr, i)` as
                      the inner parallel directive.<br>
                    </p>
                    <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
                    <div>16.06.2020 9:26 PM, Ye Luo via Openmp-dev
                      пишет:<br>
                    </div>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div>Hi all,</div>
                        <div>I got libomptarget failure when offloading
                          the same kernel function from multiple host
                          threads. I reported the bug as <a href="https://bugs.llvm.org/show_bug.cgi?id=46257" target="_blank">https://bugs.llvm.org/show_bug.cgi?id=46257</a>.</div>
                        <div><br>
                        </div>
                        <div>After a bit investigation, I'm able to
                          narrow down a bit the scope of the issue.</div>
                        <div>The error from CUDA runtime is "an illegal
                          memory access was encountered".</div>
                        <div><br>
                        </div>
                        <div>
                          <div>A) On the libomptarget side<br>
                          </div>
                        </div>
                        <div>In libomptarget/plugins/cuda/src/rtl.cpp, I
                          added a few synchronization before and after
                          cuLaunchKernel and a mutex to protect the
                          kernel execution.<br>
                        </div>
                        <div><br>
                        </div>
                        <div>    Err = cuCtxSynchronize();<br>
                              if (!checkResult(Err, "Error before
                          cuLaunchKernel\n"))<br>
                                return OFFLOAD_FAIL;<br>
                          <br>
                              //kernelmtx.lock();<br>
                              Err = cuLaunchKernel(KernelInfo->Func,
                          CudaBlocksPerGrid, /* gridDimY */ 1,<br>
                                                   /* gridDimZ */ 1,
                          CudaThreadsPerBlock,<br>
                                                   /* blockDimY */ 1, /*
                          blockDimZ */ 1,<br>
                                                   /* sharedMemBytes */
                          0, Stream, &Args[0], nullptr);<br>
                              if (!checkResult(Err, "Error returned from
                          cuLaunchKernel\n"))<br>
                                return OFFLOAD_FAIL;<br>
                          <br>
                              Err = cuCtxSynchronize();<br>
                              //kernelmtx.unlock();<br>
                              if (!checkResult(Err, "Error after
                          cuLaunchKernel\n"))<br>
                                return OFFLOAD_FAIL;</div>
                        <div><br>
                        </div>
                        <div>Without the mutex. the first error from all
                          the threads is "Error after cuLaunchKernel".
                          After enabling the mutex, the code runs well.
                          If I move the mutex unlock() before the
                          cuCtxSynchronize, the code still runs into
                          error. So I think the error comes from kernel
                          execution not something else.</div>
                        <div><br>
                        </div>
                        <div>2) on my application side, I tried to see
                          what triggers the error. Once I commented out
                          "omp parallel" in the hierarchical
                          parallelism. <a href="https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258" target="_blank">https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258</a><br>
                        </div>
                        <div>Multi host threads offloading start to run
                          without error.</div>
                        <div><br>
                        </div>
                        <div>I'd like to ask the wider community to see
                          potential causes of this bug.<br>
                        </div>
                        <div>Is there a thread unsafe state machine with
                          a kernel generated from a target region? Or
                          any other potential reason for the error?</div>
                        <div><br>
                        </div>
                        <div>Best,</div>
                        <div>Ye<br>
                        </div>
                        <div>
                          <div dir="ltr">
                            <div dir="ltr">
                              <div>
                                <div dir="ltr">===================<br>
                                  Ye Luo, Ph.D.<br>
                                  Computational Science Division &
                                  Leadership Computing Facility<br>
                                  Argonne National Laboratory</div>
                              </div>
                            </div>
                          </div>
                        </div>
                      </div>
                      <br>
                      <fieldset></fieldset>
                      <pre>_______________________________________________
Openmp-dev mailing list
<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>
</pre>
                    </blockquote>
                  </div>
                </blockquote>
              </div>
              <br>
              <fieldset></fieldset>
              <pre>_______________________________________________
Openmp-dev mailing list
<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>
</pre>
            </blockquote>
          </div>
        </blockquote>
      </div>
      <br>
      <fieldset></fieldset>
      <pre>_______________________________________________
Openmp-dev mailing list
<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>
</pre>
    </blockquote>
  </div>

</blockquote></div>