<div dir="ltr"><div>Do you have an example using shared memory via #pragma omp allocate?</div><div>Ye<br></div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr">===================<br>
Ye Luo, Ph.D.<br>Computational Science Division & Leadership Computing Facility<br>
Argonne National Laboratory</div></div></div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020 at 2:09 PM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com">a.bataev@outlook.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div>
    <p>Yes, reduction variables also can be candidates for globalization
      if the reduction variable must be shared in parallel region. You
      can try to control it manually using allocate clause.<br>
    </p>
    <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
    <div>25.06.2020 3:05 PM, Ye Luo пишет:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">
        <div>I cannot use openmp-cuda-mode. I don't remember now exactly
          where (reduction?) but it breaks other things in my code.<br>
        </div>
        <div>Ye<br>
        </div>
        <div>
          <div>
            <div>
              <div dir="ltr">
                <div dir="ltr">
                  <div>
                    <div dir="ltr">===================<br>
                      Ye Luo, Ph.D.<br>
                      Computational Science Division & Leadership
                      Computing Facility<br>
                      Argonne National Laboratory</div>
                  </div>
                </div>
              </div>
            </div>
            <br>
          </div>
        </div>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020 at 2:02
          PM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>> wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div>
            <p>You can use shared memory directly by using #pragma omp
              allocate and allocate clauses. Plus, shared memory is
              actively used by the runtime. But I agree, that it would
              be good to allow to control the size. <br>
            </p>
            <p>BTW, you can disable this globalization mechanism by
              using -fopenmp-cuda-mode option. In this case, you need to
              control sharing the variables yourself.<br>
            </p>
            <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
            <div>25.06.2020 2:51 PM, Ye Luo пишет:<br>
            </div>
            <blockquote type="cite">
              <div dir="ltr">
                <div>I think hard-coded size 128 is suboptimal. Can we
                  allow users to select that?</div>
                <div>Since we cannot use shared memory directly, I
                  believe there are plenty of shared memory available.</div>
                <div>It seems that my test code is exceeding 128 byte
                  and I'd like to set a higher limit.<br>
                </div>
                <div>Ye<br>
                </div>
                <div>
                  <div>
                    <div dir="ltr">
                      <div dir="ltr">
                        <div>
                          <div dir="ltr">===================<br>
                            Ye Luo, Ph.D.<br>
                            Computational Science Division &
                            Leadership Computing Facility<br>
                            Argonne National Laboratory</div>
                        </div>
                      </div>
                    </div>
                  </div>
                  <br>
                </div>
              </div>
              <br>
              <div class="gmail_quote">
                <div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020
                  at 1:02 PM Alexey Bataev <<a href="mailto:a.bataev@hotmail.com" target="_blank">a.bataev@hotmail.com</a>>
                  wrote:<br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                  <div dir="auto"> Globalized variables are the
                    variables that must be shared between the threads
                    but initially allocated by the master thread. On
                    GPU, local memory is private for threads, and the
                    local variables cannkt be shared between the
                    threads. But OpenMP requires that such variables
                    must be shared between the threads. So, if we see,
                    that the local variable is allocated in the context
                    of the master thread and later is shared in parallel
                    region (or potentially can be shared, if it may
                    escape its declaration context by the reference or
                    by the address), such variables are allocated either
                    in shared or in global memory. If the total size of
                    globalized variables is less than 128 bytes, they
                    are allocated in shared memory. Otherwise, fallback
                    to malloc. If -fno-penmp-cuda-parallel-target-region
                    is used, then again, if total size <= 128 bytes -
                    shared memory is used, otherwise statically
                    allocated global memory buffer is used.<br>
                    <br>
                    <div dir="ltr">Best regards,
                      <div>Alexey Bataev</div>
                    </div>
                    <div dir="ltr"><br>
                      <blockquote type="cite">25 июня 2020 г., в 13:45,
                        Ye Luo <<a href="mailto:xw111luoye@gmail.com" target="_blank">xw111luoye@gmail.com</a>>
                        написал(а):<br>
                        <br>
                      </blockquote>
                    </div>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div dir="ltr">
                          <div>Could you add documentation for this
                            option?</div>
                          <div><span><tt>fopenmp-cuda-parallel-target-region</tt></span></div>
                          <div><span><tt>The name doesn't reflect the
                                actual effect.</tt></span></div>
                          <div><span><tt>The differential review
                                mentions</tt></span></div>
                          <div><span><tt>```<br>
                              </tt></span></div>
                          <div><span><tt>Added support for dynamic
                                memory allocation for globalized
                                variables in<br>
                                case if execution of target regions in
                                parallel is required.</tt></span></div>
                          <div><span><tt>```</tt></span></div>
                          <div><span><tt>So what are the globalized
                                variables? When do they occur? since it
                                impacts performance, application
                                developers desire to know more details.<br>
                              </tt></span></div>
                          <div><span><tt>```<br>
                              </tt></span></div>
                          <div><span><tt><span>
                                  <p>It might use "slow" allocation
                                    functions, in general, since it may
                                    use malloc on the device side. Being
                                    disabled, it uses statically
                                    preallocated memory, which might be
                                    faster, if parallel target regions
                                    are not required.</p>
                                  <p>```</p>
                                  In the slow code path, does it always
                                  call malloc in the device or malloc
                                  gets called when exceeding a size
                                  limit? Is the call explicitly
                                  generated by clang or the cuda driver
                                  handles it?<br>
                                </span></tt></span></div>
                          <div><span><tt><span>I saw some <span>UseSharedMemory
                                    flags in the review. Is the slow
                                    code path uses CUDA shared memory
                                    and it may need global memory when
                                    the size limit is exceeded?<br>
                                  </span></span></tt></span></div>
                          <div><span><tt><span>In the fast code path,
                                  what is the scope of this statically
                                  preallocated memory? Is it owned by
                                  the CUDA plugin?</span></tt></span></div>
                          <div><span><tt><span><br>
                                </span></tt></span></div>
                          <div><span><tt><span>Thanks,<br>
                                </span></tt></span></div>
                          <div><span><tt><span>Ye<br>
                                </span></tt></span></div>
                          <div>
                            <div>
                              <div dir="ltr">
                                <div dir="ltr">
                                  <div>
                                    <div dir="ltr">===================<br>
                                      Ye Luo, Ph.D.<br>
                                      Computational Science Division
                                      & Leadership Computing
                                      Facility<br>
                                      Argonne National Laboratory</div>
                                  </div>
                                </div>
                              </div>
                            </div>
                            <br>
                          </div>
                        </div>
                        <br>
                        <div class="gmail_quote">
                          <div dir="ltr" class="gmail_attr">On Thu, Jun
                            25, 2020 at 9:44 AM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>
                            wrote:<br>
                          </div>
                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                            <div>
                              <p>Hi Ye, the patch is committed, you can
                                check if the compiler works for you.<br>
                              </p>
                              <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
                              <div>22.06.2020 1:29 PM, Ye Luo via
                                Openmp-dev пишет:<br>
                              </div>
                              <blockquote type="cite">
                                <div dir="ltr">It is working well now. I
                                  tested both the miniapp and the full
                                  application.<br clear="all">
                                  <div>
                                    <div>
                                      <div dir="ltr">
                                        <div dir="ltr">
                                          <div>
                                            <div dir="ltr">===================<br>
                                              Ye Luo, Ph.D.<br>
                                              Computational Science
                                              Division & Leadership
                                              Computing Facility<br>
                                              Argonne National
                                              Laboratory</div>
                                          </div>
                                        </div>
                                      </div>
                                    </div>
                                    <br>
                                  </div>
                                </div>
                                <br>
                                <div class="gmail_quote">
                                  <div dir="ltr" class="gmail_attr">On
                                    Mon, Jun 22, 2020 at 11:56 AM
                                    Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>
                                    wrote:<br>
                                  </div>
                                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                    <div>
                                      <p>Hi Ye, could you try your code
                                        with the attached patch for the
                                        clang? <br>
                                      </p>
                                      <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
                                      <div>19.06.2020 6:45 PM, Ye Luo
                                        via Openmp-dev пишет:<br>
                                      </div>
                                      <blockquote type="cite">
                                        <div dir="ltr">
                                          <div>Yes. It makes the code
                                            working.</div>
                                          <div>Do you have any idea
                                            about the code pattern?<br>
                                          </div>
                                          <div>Before the compiler fix
                                            gets in, I can put the
                                            workaround in the real code
                                            which has multiple places
                                            similar to the miniapp.</div>
                                          <div>Best,<br>
                                          </div>
                                          <div>Ye<br>
                                          </div>
                                          <div>
                                            <div>
                                              <div dir="ltr">
                                                <div dir="ltr">
                                                  <div>
                                                    <div dir="ltr">===================<br>
                                                      Ye Luo, Ph.D.<br>
                                                      Computational
                                                      Science Division
                                                      & Leadership
                                                      Computing Facility<br>
                                                      Argonne National
                                                      Laboratory</div>
                                                  </div>
                                                </div>
                                              </div>
                                            </div>
                                            <br>
                                          </div>
                                        </div>
                                        <br>
                                        <div class="gmail_quote">
                                          <div dir="ltr" class="gmail_attr">On Fri,
                                            Jun 19, 2020 at 4:35 PM
                                            Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>
                                            wrote:<br>
                                          </div>
                                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                            <div>
                                              <p>Try to use `omp
                                                parallel
                                                firstprivate(first,
                                                last,
                                                offload_scratch_iVP_ptr,
                                                psiinv_ptr, i)` as the
                                                inner parallel
                                                directive.<br>
                                              </p>
                                              <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
                                              <div>16.06.2020 9:26 PM,
                                                Ye Luo via Openmp-dev
                                                пишет:<br>
                                              </div>
                                              <blockquote type="cite">
                                                <div dir="ltr">
                                                  <div>Hi all,</div>
                                                  <div>I got
                                                    libomptarget failure
                                                    when offloading the
                                                    same kernel function
                                                    from multiple host
                                                    threads. I reported
                                                    the bug as <a href="https://bugs.llvm.org/show_bug.cgi?id=46257" target="_blank">https://bugs.llvm.org/show_bug.cgi?id=46257</a>.</div>
                                                  <div><br>
                                                  </div>
                                                  <div>After a bit
                                                    investigation, I'm
                                                    able to narrow down
                                                    a bit the scope of
                                                    the issue.</div>
                                                  <div>The error from
                                                    CUDA runtime is "an
                                                    illegal memory
                                                    access was
                                                    encountered".</div>
                                                  <div><br>
                                                  </div>
                                                  <div>
                                                    <div>A) On the
                                                      libomptarget side<br>
                                                    </div>
                                                  </div>
                                                  <div>In
                                                    libomptarget/plugins/cuda/src/rtl.cpp,
                                                    I added a few
                                                    synchronization
                                                    before and after
                                                    cuLaunchKernel and a
                                                    mutex to protect the
                                                    kernel execution.<br>
                                                  </div>
                                                  <div><br>
                                                  </div>
                                                  <div>    Err =
                                                    cuCtxSynchronize();<br>
                                                        if
                                                    (!checkResult(Err,
                                                    "Error before
                                                    cuLaunchKernel\n"))<br>
                                                          return
                                                    OFFLOAD_FAIL;<br>
                                                    <br>
                                                       
                                                    //kernelmtx.lock();<br>
                                                        Err =
                                                    cuLaunchKernel(KernelInfo->Func,
                                                    CudaBlocksPerGrid,
                                                    /* gridDimY */ 1,<br>
                                                                       
                                                         /* gridDimZ */
                                                    1,
                                                    CudaThreadsPerBlock,<br>
                                                                       
                                                         /* blockDimY */
                                                    1, /* blockDimZ */
                                                    1,<br>
                                                                       
                                                         /*
                                                    sharedMemBytes */ 0,
                                                    Stream,
                                                    &Args[0],
                                                    nullptr);<br>
                                                        if
                                                    (!checkResult(Err,
                                                    "Error returned from
                                                    cuLaunchKernel\n"))<br>
                                                          return
                                                    OFFLOAD_FAIL;<br>
                                                    <br>
                                                        Err =
                                                    cuCtxSynchronize();<br>
                                                       
                                                    //kernelmtx.unlock();<br>
                                                        if
                                                    (!checkResult(Err,
                                                    "Error after
                                                    cuLaunchKernel\n"))<br>
                                                          return
                                                    OFFLOAD_FAIL;</div>
                                                  <div><br>
                                                  </div>
                                                  <div>Without the
                                                    mutex. the first
                                                    error from all the
                                                    threads is "Error
                                                    after
                                                    cuLaunchKernel".
                                                    After enabling the
                                                    mutex, the code runs
                                                    well. If I move the
                                                    mutex unlock()
                                                    before the
                                                    cuCtxSynchronize,
                                                    the code still runs
                                                    into error. So I
                                                    think the error
                                                    comes from kernel
                                                    execution not
                                                    something else.</div>
                                                  <div><br>
                                                  </div>
                                                  <div>2) on my
                                                    application side, I
                                                    tried to see what
                                                    triggers the error.
                                                    Once I commented out
                                                    "omp parallel" in
                                                    the hierarchical
                                                    parallelism. <a href="https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258" target="_blank">
https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258</a><br>
                                                  </div>
                                                  <div>Multi host
                                                    threads offloading
                                                    start to run without
                                                    error.</div>
                                                  <div><br>
                                                  </div>
                                                  <div>I'd like to ask
                                                    the wider community
                                                    to see potential
                                                    causes of this bug.<br>
                                                  </div>
                                                  <div>Is there a thread
                                                    unsafe state machine
                                                    with a kernel
                                                    generated from a
                                                    target region? Or
                                                    any other potential
                                                    reason for the
                                                    error?</div>
                                                  <div><br>
                                                  </div>
                                                  <div>Best,</div>
                                                  <div>Ye<br>
                                                  </div>
                                                  <div>
                                                    <div dir="ltr">
                                                      <div dir="ltr">
                                                        <div>
                                                          <div dir="ltr">===================<br>
                                                          Ye Luo, Ph.D.<br>
                                                          Computational
                                                          Science
                                                          Division &
                                                          Leadership
                                                          Computing
                                                          Facility<br>
                                                          Argonne
                                                          National
                                                          Laboratory</div>
                                                        </div>
                                                      </div>
                                                    </div>
                                                  </div>
                                                </div>
                                                <br>
                                                <fieldset></fieldset>
                                                <pre>_______________________________________________
Openmp-dev mailing list
<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>
</pre>
                                              </blockquote>
                                            </div>
                                          </blockquote>
                                        </div>
                                        <br>
                                        <fieldset></fieldset>
                                        <pre>_______________________________________________
Openmp-dev mailing list
<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>
</pre>
                                      </blockquote>
                                    </div>
                                  </blockquote>
                                </div>
                                <br>
                                <fieldset></fieldset>
                                <pre>_______________________________________________
Openmp-dev mailing list
<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>
</pre>
                              </blockquote>
                            </div>
                          </blockquote>
                        </div>
                        <span>_______________________________________________</span><br>
                        <span>Openmp-dev mailing list</span><br>
                        <span><a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a></span><br>
                        <span><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a></span><br>
                      </div>
                    </blockquote>
                  </div>
                </blockquote>
              </div>
            </blockquote>
          </div>
        </blockquote>
      </div>
    </blockquote>
  </div>

</blockquote></div>