<div dir="ltr"><div>I cannot use openmp-cuda-mode. I don't remember now exactly where (reduction?) but it breaks other things in my code.<br></div><div>Ye<br></div><div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr">===================<br>
Ye Luo, Ph.D.<br>Computational Science Division & Leadership Computing Facility<br>
Argonne National Laboratory</div></div></div></div></div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020 at 2:02 PM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com">a.bataev@outlook.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div>
    <p>You can use shared memory directly by using #pragma omp allocate
      and allocate clauses. Plus, shared memory is actively used by the
      runtime. But I agree, that it would be good to allow to control
      the size. <br>
    </p>
    <p>BTW, you can disable this globalization mechanism by using
      -fopenmp-cuda-mode option. In this case, you need to control
      sharing the variables yourself.<br>
    </p>
    <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
    <div>25.06.2020 2:51 PM, Ye Luo пишет:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">
        <div>I think hard-coded size 128 is suboptimal. Can we allow
          users to select that?</div>
        <div>Since we cannot use shared memory directly, I believe there
          are plenty of shared memory available.</div>
        <div>It seems that my test code is exceeding 128 byte and I'd
          like to set a higher limit.<br>
        </div>
        <div>Ye<br>
        </div>
        <div>
          <div>
            <div dir="ltr">
              <div dir="ltr">
                <div>
                  <div dir="ltr">===================<br>
                    Ye Luo, Ph.D.<br>
                    Computational Science Division & Leadership
                    Computing Facility<br>
                    Argonne National Laboratory</div>
                </div>
              </div>
            </div>
          </div>
          <br>
        </div>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020 at 1:02
          PM Alexey Bataev <<a href="mailto:a.bataev@hotmail.com" target="_blank">a.bataev@hotmail.com</a>> wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div dir="auto">
            Globalized variables are the variables that must be shared
            between the threads but initially allocated by the master
            thread. On GPU, local memory is private for threads, and the
            local variables cannkt be shared between the threads. But
            OpenMP requires that such variables must be shared between
            the threads. So, if we see, that the local variable is
            allocated in the context of the master thread and later is
            shared in parallel region (or potentially can be shared, if
            it may escape its declaration context by the reference or by
            the address), such variables are allocated either in shared
            or in global memory. If the total size of globalized
            variables is less than 128 bytes, they are allocated in
            shared memory. Otherwise, fallback to malloc. If
            -fno-penmp-cuda-parallel-target-region is used, then again,
            if total size <= 128 bytes - shared memory is used,
            otherwise statically allocated global memory buffer is used.<br>
            <br>
            <div dir="ltr">Best regards,
              <div>Alexey Bataev</div>
            </div>
            <div dir="ltr"><br>
              <blockquote type="cite">25 июня 2020 г., в 13:45, Ye Luo
                <<a href="mailto:xw111luoye@gmail.com" target="_blank">xw111luoye@gmail.com</a>>
                написал(а):<br>
                <br>
              </blockquote>
            </div>
            <blockquote type="cite">
              <div dir="ltr">
                <div dir="ltr">
                  <div>Could you add documentation for this option?</div>
                  <div><span><tt>fopenmp-cuda-parallel-target-region</tt></span></div>
                  <div><span><tt>The name doesn't reflect the actual
                        effect.</tt></span></div>
                  <div><span><tt>The differential review mentions</tt></span></div>
                  <div><span><tt>```<br>
                      </tt></span></div>
                  <div><span><tt>Added support for dynamic memory
                        allocation for globalized variables in<br>
                        case if execution of target regions in parallel
                        is required.</tt></span></div>
                  <div><span><tt>```</tt></span></div>
                  <div><span><tt>So what are the globalized variables?
                        When do they occur? since it impacts
                        performance, application developers desire to
                        know more details.<br>
                      </tt></span></div>
                  <div><span><tt>```<br>
                      </tt></span></div>
                  <div><span><tt><span>
                          <p>It might use "slow" allocation functions,
                            in general, since it may use malloc on the
                            device side. Being disabled, it uses
                            statically preallocated memory, which might
                            be faster, if parallel target regions are
                            not required.</p>
                          <p>```</p>
                          In the slow code path, does it always call
                          malloc in the device or malloc gets called
                          when exceeding a size limit? Is the call
                          explicitly generated by clang or the cuda
                          driver handles it?<br>
                        </span></tt></span></div>
                  <div><span><tt><span>I saw some
                          <span>UseSharedMemory flags in the review. Is
                            the slow code path uses CUDA shared memory
                            and it may need global memory when the size
                            limit is exceeded?<br>
                          </span></span></tt></span></div>
                  <div><span><tt><span>In the fast code path, what is
                          the scope of this statically preallocated
                          memory? Is it owned by the CUDA plugin?</span></tt></span></div>
                  <div><span><tt><span><br>
                        </span></tt></span></div>
                  <div><span><tt><span>Thanks,<br>
                        </span></tt></span></div>
                  <div><span><tt><span>Ye<br>
                        </span></tt></span></div>
                  <div>
                    <div>
                      <div dir="ltr">
                        <div dir="ltr">
                          <div>
                            <div dir="ltr">===================<br>
                              Ye Luo, Ph.D.<br>
                              Computational Science Division &
                              Leadership Computing Facility<br>
                              Argonne National Laboratory</div>
                          </div>
                        </div>
                      </div>
                    </div>
                    <br>
                  </div>
                </div>
                <br>
                <div class="gmail_quote">
                  <div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020
                    at 9:44 AM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>
                    wrote:<br>
                  </div>
                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                    <div>
                      <p>Hi Ye, the patch is committed, you can check if
                        the compiler works for you.<br>
                      </p>
                      <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
                      <div>22.06.2020 1:29 PM, Ye Luo via Openmp-dev
                        пишет:<br>
                      </div>
                      <blockquote type="cite">
                        <div dir="ltr">It is working well now. I tested
                          both the miniapp and the full application.<br clear="all">
                          <div>
                            <div>
                              <div dir="ltr">
                                <div dir="ltr">
                                  <div>
                                    <div dir="ltr">===================<br>
                                      Ye Luo, Ph.D.<br>
                                      Computational Science Division
                                      & Leadership Computing
                                      Facility<br>
                                      Argonne National Laboratory</div>
                                  </div>
                                </div>
                              </div>
                            </div>
                            <br>
                          </div>
                        </div>
                        <br>
                        <div class="gmail_quote">
                          <div dir="ltr" class="gmail_attr">On Mon, Jun
                            22, 2020 at 11:56 AM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>
                            wrote:<br>
                          </div>
                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                            <div>
                              <p>Hi Ye, could you try your code with the
                                attached patch for the clang? <br>
                              </p>
                              <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
                              <div>19.06.2020 6:45 PM, Ye Luo via
                                Openmp-dev пишет:<br>
                              </div>
                              <blockquote type="cite">
                                <div dir="ltr">
                                  <div>Yes. It makes the code working.</div>
                                  <div>Do you have any idea about the
                                    code pattern?<br>
                                  </div>
                                  <div>Before the compiler fix gets in,
                                    I can put the workaround in the real
                                    code which has multiple places
                                    similar to the miniapp.</div>
                                  <div>Best,<br>
                                  </div>
                                  <div>Ye<br>
                                  </div>
                                  <div>
                                    <div>
                                      <div dir="ltr">
                                        <div dir="ltr">
                                          <div>
                                            <div dir="ltr">===================<br>
                                              Ye Luo, Ph.D.<br>
                                              Computational Science
                                              Division & Leadership
                                              Computing Facility<br>
                                              Argonne National
                                              Laboratory</div>
                                          </div>
                                        </div>
                                      </div>
                                    </div>
                                    <br>
                                  </div>
                                </div>
                                <br>
                                <div class="gmail_quote">
                                  <div dir="ltr" class="gmail_attr">On
                                    Fri, Jun 19, 2020 at 4:35 PM
                                    Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>
                                    wrote:<br>
                                  </div>
                                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                    <div>
                                      <p>Try to use `omp parallel
                                        firstprivate(first, last,
                                        offload_scratch_iVP_ptr,
                                        psiinv_ptr, i)` as the inner
                                        parallel directive.<br>
                                      </p>
                                      <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
                                      <div>16.06.2020 9:26 PM, Ye Luo
                                        via Openmp-dev пишет:<br>
                                      </div>
                                      <blockquote type="cite">
                                        <div dir="ltr">
                                          <div>Hi all,</div>
                                          <div>I got libomptarget
                                            failure when offloading the
                                            same kernel function from
                                            multiple host threads. I
                                            reported the bug as
                                            <a href="https://bugs.llvm.org/show_bug.cgi?id=46257" target="_blank">https://bugs.llvm.org/show_bug.cgi?id=46257</a>.</div>
                                          <div><br>
                                          </div>
                                          <div>After a bit
                                            investigation, I'm able to
                                            narrow down a bit the scope
                                            of the issue.</div>
                                          <div>The error from CUDA
                                            runtime is "an illegal
                                            memory access was
                                            encountered".</div>
                                          <div><br>
                                          </div>
                                          <div>
                                            <div>A) On the libomptarget
                                              side<br>
                                            </div>
                                          </div>
                                          <div>In
                                            libomptarget/plugins/cuda/src/rtl.cpp,
                                            I added a few
                                            synchronization before and
                                            after cuLaunchKernel and a
                                            mutex to protect the kernel
                                            execution.<br>
                                          </div>
                                          <div><br>
                                          </div>
                                          <div>    Err =
                                            cuCtxSynchronize();<br>
                                                if (!checkResult(Err,
                                            "Error before
                                            cuLaunchKernel\n"))<br>
                                                  return OFFLOAD_FAIL;<br>
                                            <br>
                                                //kernelmtx.lock();<br>
                                                Err =
                                            cuLaunchKernel(KernelInfo->Func,
                                            CudaBlocksPerGrid, /*
                                            gridDimY */ 1,<br>
                                                                     /*
                                            gridDimZ */ 1,
                                            CudaThreadsPerBlock,<br>
                                                                     /*
                                            blockDimY */ 1, /* blockDimZ
                                            */ 1,<br>
                                                                     /*
                                            sharedMemBytes */ 0, Stream,
                                            &Args[0], nullptr);<br>
                                                if (!checkResult(Err,
                                            "Error returned from
                                            cuLaunchKernel\n"))<br>
                                                  return OFFLOAD_FAIL;<br>
                                            <br>
                                                Err =
                                            cuCtxSynchronize();<br>
                                                //kernelmtx.unlock();<br>
                                                if (!checkResult(Err,
                                            "Error after
                                            cuLaunchKernel\n"))<br>
                                                  return OFFLOAD_FAIL;</div>
                                          <div><br>
                                          </div>
                                          <div>Without the mutex. the
                                            first error from all the
                                            threads is "Error after
                                            cuLaunchKernel". After
                                            enabling the mutex, the code
                                            runs well. If I move the
                                            mutex unlock() before the
                                            cuCtxSynchronize, the code
                                            still runs into error. So I
                                            think the error comes from
                                            kernel execution not
                                            something else.</div>
                                          <div><br>
                                          </div>
                                          <div>2) on my application
                                            side, I tried to see what
                                            triggers the error. Once I
                                            commented out "omp parallel"
                                            in the hierarchical
                                            parallelism.
                                            <a href="https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258" target="_blank">
https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258</a><br>
                                          </div>
                                          <div>Multi host threads
                                            offloading start to run
                                            without error.</div>
                                          <div><br>
                                          </div>
                                          <div>I'd like to ask the wider
                                            community to see potential
                                            causes of this bug.<br>
                                          </div>
                                          <div>Is there a thread unsafe
                                            state machine with a kernel
                                            generated from a target
                                            region? Or any other
                                            potential reason for the
                                            error?</div>
                                          <div><br>
                                          </div>
                                          <div>Best,</div>
                                          <div>Ye<br>
                                          </div>
                                          <div>
                                            <div dir="ltr">
                                              <div dir="ltr">
                                                <div>
                                                  <div dir="ltr">===================<br>
                                                    Ye Luo, Ph.D.<br>
                                                    Computational
                                                    Science Division
                                                    & Leadership
                                                    Computing Facility<br>
                                                    Argonne National
                                                    Laboratory</div>
                                                </div>
                                              </div>
                                            </div>
                                          </div>
                                        </div>
                                        <br>
                                        <fieldset></fieldset>
                                        <pre>_______________________________________________
Openmp-dev mailing list
<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>
</pre>
                                      </blockquote>
                                    </div>
                                  </blockquote>
                                </div>
                                <br>
                                <fieldset></fieldset>
                                <pre>_______________________________________________
Openmp-dev mailing list
<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>
</pre>
                              </blockquote>
                            </div>
                          </blockquote>
                        </div>
                        <br>
                        <fieldset></fieldset>
                        <pre>_______________________________________________
Openmp-dev mailing list
<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>
</pre>
                      </blockquote>
                    </div>
                  </blockquote>
                </div>
                <span>_______________________________________________</span><br>
                <span>Openmp-dev mailing list</span><br>
                <span><a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a></span><br>
                <span><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a></span><br>
              </div>
            </blockquote>
          </div>
        </blockquote>
      </div>
    </blockquote>
  </div>

</blockquote></div>