<div dir="ltr"><div>Do you have an example using shared memory via #pragma omp allocate?</div><div>Ye<br></div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr">===================<br>

Ye Luo, Ph.D.<br>Computational Science Division & Leadership Computing Facility<br>

Argonne National Laboratory</div></div></div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020 at 2:09 PM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com">a.bataev@outlook.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div>

    <p>Yes, reduction variables also can be candidates for globalization

      if the reduction variable must be shared in parallel region. You

      can try to control it manually using allocate clause.<br>

    </p>

    <pre cols="72">-------------

Best regards,

Alexey Bataev</pre>

    <div>25.06.2020 3:05 PM, Ye Luo пишет:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">

        <div>I cannot use openmp-cuda-mode. I don't remember now exactly

          where (reduction?) but it breaks other things in my code.<br>

        </div>

        <div>Ye<br>

        </div>

        <div>

          <div>

            <div>

              <div dir="ltr">

                <div dir="ltr">

                  <div>

                    <div dir="ltr">===================<br>

                      Ye Luo, Ph.D.<br>

                      Computational Science Division & Leadership

                      Computing Facility<br>

                      Argonne National Laboratory</div>

                  </div>

                </div>

              </div>

            </div>

            <br>

          </div>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020 at 2:02

          PM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>> wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

          <div>

            <p>You can use shared memory directly by using #pragma omp

              allocate and allocate clauses. Plus, shared memory is

              actively used by the runtime. But I agree, that it would

              be good to allow to control the size. <br>

            </p>

            <p>BTW, you can disable this globalization mechanism by

              using -fopenmp-cuda-mode option. In this case, you need to

              control sharing the variables yourself.<br>

            </p>

            <pre cols="72">-------------

Best regards,

Alexey Bataev</pre>

            <div>25.06.2020 2:51 PM, Ye Luo пишет:<br>

            </div>

            <blockquote type="cite">

              <div dir="ltr">

                <div>I think hard-coded size 128 is suboptimal. Can we

                  allow users to select that?</div>

                <div>Since we cannot use shared memory directly, I

                  believe there are plenty of shared memory available.</div>

                <div>It seems that my test code is exceeding 128 byte

                  and I'd like to set a higher limit.<br>

                </div>

                <div>Ye<br>

                </div>

                <div>

                  <div>

                    <div dir="ltr">

                      <div dir="ltr">

                        <div>

                          <div dir="ltr">===================<br>

                            Ye Luo, Ph.D.<br>

                            Computational Science Division &

                            Leadership Computing Facility<br>

                            Argonne National Laboratory</div>

                        </div>

                      </div>

                    </div>

                  </div>

                  <br>

                </div>

              </div>

              <br>

              <div class="gmail_quote">

                <div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020

                  at 1:02 PM Alexey Bataev <<a href="mailto:a.bataev@hotmail.com" target="_blank">a.bataev@hotmail.com</a>>

                  wrote:<br>

                </div>

                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                  <div dir="auto"> Globalized variables are the

                    variables that must be shared between the threads

                    but initially allocated by the master thread. On

                    GPU, local memory is private for threads, and the

                    local variables cannkt be shared between the

                    threads. But OpenMP requires that such variables

                    must be shared between the threads. So, if we see,

                    that the local variable is allocated in the context

                    of the master thread and later is shared in parallel

                    region (or potentially can be shared, if it may

                    escape its declaration context by the reference or

                    by the address), such variables are allocated either

                    in shared or in global memory. If the total size of

                    globalized variables is less than 128 bytes, they

                    are allocated in shared memory. Otherwise, fallback

                    to malloc. If -fno-penmp-cuda-parallel-target-region

                    is used, then again, if total size <= 128 bytes -

                    shared memory is used, otherwise statically

                    allocated global memory buffer is used.<br>

                    <br>

                    <div dir="ltr">Best regards,

                      <div>Alexey Bataev</div>

                    </div>

                    <div dir="ltr"><br>

                      <blockquote type="cite">25 июня 2020 г., в 13:45,

                        Ye Luo <<a href="mailto:xw111luoye@gmail.com" target="_blank">xw111luoye@gmail.com</a>>

                        написал(а):<br>

                        <br>

                      </blockquote>

                    </div>

                    <blockquote type="cite">

                      <div dir="ltr">

                        <div dir="ltr">

                          <div>Could you add documentation for this

                            option?</div>

                          <div><span><tt>fopenmp-cuda-parallel-target-region</tt></span></div>

                          <div><span><tt>The name doesn't reflect the

                                actual effect.</tt></span></div>

                          <div><span><tt>The differential review

                                mentions</tt></span></div>

                          <div><span><tt>```<br>

                              </tt></span></div>

                          <div><span><tt>Added support for dynamic

                                memory allocation for globalized

                                variables in<br>

                                case if execution of target regions in

                                parallel is required.</tt></span></div>

                          <div><span><tt>```</tt></span></div>

                          <div><span><tt>So what are the globalized

                                variables? When do they occur? since it

                                impacts performance, application

                                developers desire to know more details.<br>

                              </tt></span></div>

                          <div><span><tt>```<br>

                              </tt></span></div>

                          <div><span><tt><span>

                                  <p>It might use "slow" allocation

                                    functions, in general, since it may

                                    use malloc on the device side. Being

                                    disabled, it uses statically

                                    preallocated memory, which might be

                                    faster, if parallel target regions

                                    are not required.</p>

                                  <p>```</p>

                                  In the slow code path, does it always

                                  call malloc in the device or malloc

                                  gets called when exceeding a size

                                  limit? Is the call explicitly

                                  generated by clang or the cuda driver

                                  handles it?<br>

                                </span></tt></span></div>

                          <div><span><tt><span>I saw some <span>UseSharedMemory

                                    flags in the review. Is the slow

                                    code path uses CUDA shared memory

                                    and it may need global memory when

                                    the size limit is exceeded?<br>

                                  </span></span></tt></span></div>

                          <div><span><tt><span>In the fast code path,

                                  what is the scope of this statically

                                  preallocated memory? Is it owned by

                                  the CUDA plugin?</span></tt></span></div>

                          <div><span><tt><span><br>

                                </span></tt></span></div>

                          <div><span><tt><span>Thanks,<br>

                                </span></tt></span></div>

                          <div><span><tt><span>Ye<br>

                                </span></tt></span></div>

                          <div>

                            <div>

                              <div dir="ltr">

                                <div dir="ltr">

                                  <div>

                                    <div dir="ltr">===================<br>

                                      Ye Luo, Ph.D.<br>

                                      Computational Science Division

                                      & Leadership Computing

                                      Facility<br>

                                      Argonne National Laboratory</div>

                                  </div>

                                </div>

                              </div>

                            </div>

                            <br>

                          </div>

                        </div>

                        <br>

                        <div class="gmail_quote">

                          <div dir="ltr" class="gmail_attr">On Thu, Jun

                            25, 2020 at 9:44 AM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>

                            wrote:<br>

                          </div>

                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                            <div>

                              <p>Hi Ye, the patch is committed, you can

                                check if the compiler works for you.<br>

                              </p>

                              <pre cols="72">-------------

Best regards,

Alexey Bataev</pre>

                              <div>22.06.2020 1:29 PM, Ye Luo via

                                Openmp-dev пишет:<br>

                              </div>

                              <blockquote type="cite">

                                <div dir="ltr">It is working well now. I

                                  tested both the miniapp and the full

                                  application.<br clear="all">

                                  <div>

                                    <div>

                                      <div dir="ltr">

                                        <div dir="ltr">

                                          <div>

                                            <div dir="ltr">===================<br>

                                              Ye Luo, Ph.D.<br>

                                              Computational Science

                                              Division & Leadership

                                              Computing Facility<br>

                                              Argonne National

                                              Laboratory</div>

                                          </div>

                                        </div>

                                      </div>

                                    </div>

                                    <br>

                                  </div>

                                </div>

                                <br>

                                <div class="gmail_quote">

                                  <div dir="ltr" class="gmail_attr">On

                                    Mon, Jun 22, 2020 at 11:56 AM

                                    Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>

                                    wrote:<br>

                                  </div>

                                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                                    <div>

                                      <p>Hi Ye, could you try your code

                                        with the attached patch for the

                                        clang? <br>

                                      </p>

                                      <pre cols="72">-------------

Best regards,

Alexey Bataev</pre>

                                      <div>19.06.2020 6:45 PM, Ye Luo

                                        via Openmp-dev пишет:<br>

                                      </div>

                                      <blockquote type="cite">

                                        <div dir="ltr">

                                          <div>Yes. It makes the code

                                            working.</div>

                                          <div>Do you have any idea

                                            about the code pattern?<br>

                                          </div>

                                          <div>Before the compiler fix

                                            gets in, I can put the

                                            workaround in the real code

                                            which has multiple places

                                            similar to the miniapp.</div>

                                          <div>Best,<br>

                                          </div>

                                          <div>Ye<br>

                                          </div>

                                          <div>

                                            <div>

                                              <div dir="ltr">

                                                <div dir="ltr">

                                                  <div>

                                                    <div dir="ltr">===================<br>

                                                      Ye Luo, Ph.D.<br>

                                                      Computational

                                                      Science Division

                                                      & Leadership

                                                      Computing Facility<br>

                                                      Argonne National

                                                      Laboratory</div>

                                                  </div>

                                                </div>

                                              </div>

                                            </div>

                                            <br>

                                          </div>

                                        </div>

                                        <br>

                                        <div class="gmail_quote">

                                          <div dir="ltr" class="gmail_attr">On Fri,

                                            Jun 19, 2020 at 4:35 PM

                                            Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>

                                            wrote:<br>

                                          </div>

                                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                                            <div>

                                              <p>Try to use `omp

                                                parallel

                                                firstprivate(first,

                                                last,

                                                offload_scratch_iVP_ptr,

                                                psiinv_ptr, i)` as the

                                                inner parallel

                                                directive.<br>

                                              </p>

                                              <pre cols="72">-------------

Best regards,

Alexey Bataev</pre>

                                              <div>16.06.2020 9:26 PM,

                                                Ye Luo via Openmp-dev

                                                пишет:<br>

                                              </div>

                                              <blockquote type="cite">

                                                <div dir="ltr">

                                                  <div>Hi all,</div>

                                                  <div>I got

                                                    libomptarget failure

                                                    when offloading the

                                                    same kernel function

                                                    from multiple host

                                                    threads. I reported

                                                    the bug as <a href="https://bugs.llvm.org/show_bug.cgi?id=46257" target="_blank">https://bugs.llvm.org/show_bug.cgi?id=46257</a>.</div>

                                                  <div><br>

                                                  </div>

                                                  <div>After a bit

                                                    investigation, I'm

                                                    able to narrow down

                                                    a bit the scope of

                                                    the issue.</div>

                                                  <div>The error from

                                                    CUDA runtime is "an

                                                    illegal memory

                                                    access was

                                                    encountered".</div>

                                                  <div><br>

                                                  </div>

                                                  <div>

                                                    <div>A) On the

                                                      libomptarget side<br>

                                                    </div>

                                                  </div>

                                                  <div>In

                                                    libomptarget/plugins/cuda/src/rtl.cpp,

                                                    I added a few

                                                    synchronization

                                                    before and after

                                                    cuLaunchKernel and a

                                                    mutex to protect the

                                                    kernel execution.<br>

                                                  </div>

                                                  <div><br>

                                                  </div>

                                                  <div>    Err =

                                                    cuCtxSynchronize();<br>

                                                        if

                                                    (!checkResult(Err,

                                                    "Error before

                                                    cuLaunchKernel\n"))<br>

                                                          return

                                                    OFFLOAD_FAIL;<br>

                                                    <br>

                                                    //kernelmtx.lock();<br>

                                                        Err =

                                                    cuLaunchKernel(KernelInfo->Func,

                                                    CudaBlocksPerGrid,

                                                    /* gridDimY */ 1,<br>

                                                         /* gridDimZ */

                                                    1,

                                                    CudaThreadsPerBlock,<br>

                                                         /* blockDimY */

                                                    1, /* blockDimZ */

                                                    1,<br>

                                                         /*

                                                    sharedMemBytes */ 0,

                                                    Stream,

                                                    &Args[0],

                                                    nullptr);<br>

                                                        if

                                                    (!checkResult(Err,

                                                    "Error returned from

                                                    cuLaunchKernel\n"))<br>

                                                          return

                                                    OFFLOAD_FAIL;<br>

                                                    <br>

                                                        Err =

                                                    cuCtxSynchronize();<br>

                                                    //kernelmtx.unlock();<br>

                                                        if

                                                    (!checkResult(Err,

                                                    "Error after

                                                    cuLaunchKernel\n"))<br>

                                                          return

                                                    OFFLOAD_FAIL;</div>

                                                  <div><br>

                                                  </div>

                                                  <div>Without the

                                                    mutex. the first

                                                    error from all the

                                                    threads is "Error

                                                    after

                                                    cuLaunchKernel".

                                                    After enabling the

                                                    mutex, the code runs

                                                    well. If I move the

                                                    mutex unlock()

                                                    before the

                                                    cuCtxSynchronize,

                                                    the code still runs

                                                    into error. So I

                                                    think the error

                                                    comes from kernel

                                                    execution not

                                                    something else.</div>

                                                  <div><br>

                                                  </div>

                                                  <div>2) on my

                                                    application side, I

                                                    tried to see what

                                                    triggers the error.

                                                    Once I commented out

                                                    "omp parallel" in

                                                    the hierarchical

                                                    parallelism. <a href="https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258" target="_blank">

https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258</a><br>

                                                  </div>

                                                  <div>Multi host

                                                    threads offloading

                                                    start to run without

                                                    error.</div>

                                                  <div><br>

                                                  </div>

                                                  <div>I'd like to ask

                                                    the wider community

                                                    to see potential

                                                    causes of this bug.<br>

                                                  </div>

                                                  <div>Is there a thread

                                                    unsafe state machine

                                                    with a kernel

                                                    generated from a

                                                    target region? Or

                                                    any other potential

                                                    reason for the

                                                    error?</div>

                                                  <div><br>

                                                  </div>

                                                  <div>Best,</div>

                                                  <div>Ye<br>

                                                  </div>

                                                  <div>

                                                    <div dir="ltr">

                                                      <div dir="ltr">

                                                        <div>

                                                          <div dir="ltr">===================<br>

                                                          Ye Luo, Ph.D.<br>

                                                          Computational

                                                          Science

                                                          Division &

                                                          Leadership

                                                          Computing

                                                          Facility<br>

                                                          Argonne

                                                          National

                                                          Laboratory</div>

                                                        </div>

                                                      </div>

                                                    </div>

                                                  </div>

                                                </div>

                                                <br>

                                                <fieldset></fieldset>

                                                <pre>_______________________________________________

Openmp-dev mailing list

<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>

</pre>

                                              </blockquote>

                                            </div>

                                          </blockquote>

                                        </div>

                                        <br>

                                        <fieldset></fieldset>

                                        <pre>_______________________________________________

Openmp-dev mailing list

<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>

</pre>

                                      </blockquote>

                                    </div>

                                  </blockquote>

                                </div>

                                <br>

                                <fieldset></fieldset>

                                <pre>_______________________________________________

Openmp-dev mailing list

<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>

</pre>

                              </blockquote>

                            </div>

                          </blockquote>

                        </div>

                        <span>_______________________________________________</span><br>

                        <span>Openmp-dev mailing list</span><br>

                        <span><a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a></span><br>

                        <span><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a></span><br>

                      </div>

                    </blockquote>

                  </div>

                </blockquote>

              </div>

            </blockquote>

          </div>

        </blockquote>

      </div>

    </blockquote>

  </div>

</blockquote></div>