<div dir="ltr"><div>I cannot use openmp-cuda-mode. I don't remember now exactly where (reduction?) but it breaks other things in my code.<br></div><div>Ye<br></div><div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr">===================<br>

Ye Luo, Ph.D.<br>Computational Science Division & Leadership Computing Facility<br>

Argonne National Laboratory</div></div></div></div></div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020 at 2:02 PM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com">a.bataev@outlook.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div>

    <p>You can use shared memory directly by using #pragma omp allocate

      and allocate clauses. Plus, shared memory is actively used by the

      runtime. But I agree, that it would be good to allow to control

      the size. <br>

    </p>

    <p>BTW, you can disable this globalization mechanism by using

      -fopenmp-cuda-mode option. In this case, you need to control

      sharing the variables yourself.<br>

    </p>

    <pre cols="72">-------------

Best regards,

Alexey Bataev</pre>

    <div>25.06.2020 2:51 PM, Ye Luo пишет:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">

        <div>I think hard-coded size 128 is suboptimal. Can we allow

          users to select that?</div>

        <div>Since we cannot use shared memory directly, I believe there

          are plenty of shared memory available.</div>

        <div>It seems that my test code is exceeding 128 byte and I'd

          like to set a higher limit.<br>

        </div>

        <div>Ye<br>

        </div>

        <div>

          <div>

            <div dir="ltr">

              <div dir="ltr">

                <div>

                  <div dir="ltr">===================<br>

                    Ye Luo, Ph.D.<br>

                    Computational Science Division & Leadership

                    Computing Facility<br>

                    Argonne National Laboratory</div>

                </div>

              </div>

            </div>

          </div>

          <br>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020 at 1:02

          PM Alexey Bataev <<a href="mailto:a.bataev@hotmail.com" target="_blank">a.bataev@hotmail.com</a>> wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

          <div dir="auto">

            Globalized variables are the variables that must be shared

            between the threads but initially allocated by the master

            thread. On GPU, local memory is private for threads, and the

            local variables cannkt be shared between the threads. But

            OpenMP requires that such variables must be shared between

            the threads. So, if we see, that the local variable is

            allocated in the context of the master thread and later is

            shared in parallel region (or potentially can be shared, if

            it may escape its declaration context by the reference or by

            the address), such variables are allocated either in shared

            or in global memory. If the total size of globalized

            variables is less than 128 bytes, they are allocated in

            shared memory. Otherwise, fallback to malloc. If

            -fno-penmp-cuda-parallel-target-region is used, then again,

            if total size <= 128 bytes - shared memory is used,

            otherwise statically allocated global memory buffer is used.<br>

            <br>

            <div dir="ltr">Best regards,

              <div>Alexey Bataev</div>

            </div>

            <div dir="ltr"><br>

              <blockquote type="cite">25 июня 2020 г., в 13:45, Ye Luo

                <<a href="mailto:xw111luoye@gmail.com" target="_blank">xw111luoye@gmail.com</a>>

                написал(а):<br>

                <br>

              </blockquote>

            </div>

            <blockquote type="cite">

              <div dir="ltr">

                <div dir="ltr">

                  <div>Could you add documentation for this option?</div>

                  <div><span><tt>fopenmp-cuda-parallel-target-region</tt></span></div>

                  <div><span><tt>The name doesn't reflect the actual

                        effect.</tt></span></div>

                  <div><span><tt>The differential review mentions</tt></span></div>

                  <div><span><tt>```<br>

                      </tt></span></div>

                  <div><span><tt>Added support for dynamic memory

                        allocation for globalized variables in<br>

                        case if execution of target regions in parallel

                        is required.</tt></span></div>

                  <div><span><tt>```</tt></span></div>

                  <div><span><tt>So what are the globalized variables?

                        When do they occur? since it impacts

                        performance, application developers desire to

                        know more details.<br>

                      </tt></span></div>

                  <div><span><tt>```<br>

                      </tt></span></div>

                  <div><span><tt><span>

                          <p>It might use "slow" allocation functions,

                            in general, since it may use malloc on the

                            device side. Being disabled, it uses

                            statically preallocated memory, which might

                            be faster, if parallel target regions are

                            not required.</p>

                          <p>```</p>

                          In the slow code path, does it always call

                          malloc in the device or malloc gets called

                          when exceeding a size limit? Is the call

                          explicitly generated by clang or the cuda

                          driver handles it?<br>

                        </span></tt></span></div>

                  <div><span><tt><span>I saw some

                          <span>UseSharedMemory flags in the review. Is

                            the slow code path uses CUDA shared memory

                            and it may need global memory when the size

                            limit is exceeded?<br>

                          </span></span></tt></span></div>

                  <div><span><tt><span>In the fast code path, what is

                          the scope of this statically preallocated

                          memory? Is it owned by the CUDA plugin?</span></tt></span></div>

                  <div><span><tt><span><br>

                        </span></tt></span></div>

                  <div><span><tt><span>Thanks,<br>

                        </span></tt></span></div>

                  <div><span><tt><span>Ye<br>

                        </span></tt></span></div>

                  <div>

                    <div>

                      <div dir="ltr">

                        <div dir="ltr">

                          <div>

                            <div dir="ltr">===================<br>

                              Ye Luo, Ph.D.<br>

                              Computational Science Division &

                              Leadership Computing Facility<br>

                              Argonne National Laboratory</div>

                          </div>

                        </div>

                      </div>

                    </div>

                    <br>

                  </div>

                </div>

                <br>

                <div class="gmail_quote">

                  <div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020

                    at 9:44 AM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>

                    wrote:<br>

                  </div>

                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                    <div>

                      <p>Hi Ye, the patch is committed, you can check if

                        the compiler works for you.<br>

                      </p>

                      <pre cols="72">-------------

Best regards,

Alexey Bataev</pre>

                      <div>22.06.2020 1:29 PM, Ye Luo via Openmp-dev

                        пишет:<br>

                      </div>

                      <blockquote type="cite">

                        <div dir="ltr">It is working well now. I tested

                          both the miniapp and the full application.<br clear="all">

                          <div>

                            <div>

                              <div dir="ltr">

                                <div dir="ltr">

                                  <div>

                                    <div dir="ltr">===================<br>

                                      Ye Luo, Ph.D.<br>

                                      Computational Science Division

                                      & Leadership Computing

                                      Facility<br>

                                      Argonne National Laboratory</div>

                                  </div>

                                </div>

                              </div>

                            </div>

                            <br>

                          </div>

                        </div>

                        <br>

                        <div class="gmail_quote">

                          <div dir="ltr" class="gmail_attr">On Mon, Jun

                            22, 2020 at 11:56 AM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>

                            wrote:<br>

                          </div>

                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                            <div>

                              <p>Hi Ye, could you try your code with the

                                attached patch for the clang? <br>

                              </p>

                              <pre cols="72">-------------

Best regards,

Alexey Bataev</pre>

                              <div>19.06.2020 6:45 PM, Ye Luo via

                                Openmp-dev пишет:<br>

                              </div>

                              <blockquote type="cite">

                                <div dir="ltr">

                                  <div>Yes. It makes the code working.</div>

                                  <div>Do you have any idea about the

                                    code pattern?<br>

                                  </div>

                                  <div>Before the compiler fix gets in,

                                    I can put the workaround in the real

                                    code which has multiple places

                                    similar to the miniapp.</div>

                                  <div>Best,<br>

                                  </div>

                                  <div>Ye<br>

                                  </div>

                                  <div>

                                    <div>

                                      <div dir="ltr">

                                        <div dir="ltr">

                                          <div>

                                            <div dir="ltr">===================<br>

                                              Ye Luo, Ph.D.<br>

                                              Computational Science

                                              Division & Leadership

                                              Computing Facility<br>

                                              Argonne National

                                              Laboratory</div>

                                          </div>

                                        </div>

                                      </div>

                                    </div>

                                    <br>

                                  </div>

                                </div>

                                <br>

                                <div class="gmail_quote">

                                  <div dir="ltr" class="gmail_attr">On

                                    Fri, Jun 19, 2020 at 4:35 PM

                                    Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>

                                    wrote:<br>

                                  </div>

                                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                                    <div>

                                      <p>Try to use `omp parallel

                                        firstprivate(first, last,

                                        offload_scratch_iVP_ptr,

                                        psiinv_ptr, i)` as the inner

                                        parallel directive.<br>

                                      </p>

                                      <pre cols="72">-------------

Best regards,

Alexey Bataev</pre>

                                      <div>16.06.2020 9:26 PM, Ye Luo

                                        via Openmp-dev пишет:<br>

                                      </div>

                                      <blockquote type="cite">

                                        <div dir="ltr">

                                          <div>Hi all,</div>

                                          <div>I got libomptarget

                                            failure when offloading the

                                            same kernel function from

                                            multiple host threads. I

                                            reported the bug as

                                            <a href="https://bugs.llvm.org/show_bug.cgi?id=46257" target="_blank">https://bugs.llvm.org/show_bug.cgi?id=46257</a>.</div>

                                          <div><br>

                                          </div>

                                          <div>After a bit

                                            investigation, I'm able to

                                            narrow down a bit the scope

                                            of the issue.</div>

                                          <div>The error from CUDA

                                            runtime is "an illegal

                                            memory access was

                                            encountered".</div>

                                          <div><br>

                                          </div>

                                          <div>

                                            <div>A) On the libomptarget

                                              side<br>

                                            </div>

                                          </div>

                                          <div>In

                                            libomptarget/plugins/cuda/src/rtl.cpp,

                                            I added a few

                                            synchronization before and

                                            after cuLaunchKernel and a

                                            mutex to protect the kernel

                                            execution.<br>

                                          </div>

                                          <div><br>

                                          </div>

                                          <div>    Err =

                                            cuCtxSynchronize();<br>

                                                if (!checkResult(Err,

                                            "Error before

                                            cuLaunchKernel\n"))<br>

                                                  return OFFLOAD_FAIL;<br>

                                            <br>

                                                //kernelmtx.lock();<br>

                                                Err =

                                            cuLaunchKernel(KernelInfo->Func,

                                            CudaBlocksPerGrid, /*

                                            gridDimY */ 1,<br>

                                                                     /*

                                            gridDimZ */ 1,

                                            CudaThreadsPerBlock,<br>

                                                                     /*

                                            blockDimY */ 1, /* blockDimZ

                                            */ 1,<br>

                                                                     /*

                                            sharedMemBytes */ 0, Stream,

                                            &Args[0], nullptr);<br>

                                                if (!checkResult(Err,

                                            "Error returned from

                                            cuLaunchKernel\n"))<br>

                                                  return OFFLOAD_FAIL;<br>

                                            <br>

                                                Err =

                                            cuCtxSynchronize();<br>

                                                //kernelmtx.unlock();<br>

                                                if (!checkResult(Err,

                                            "Error after

                                            cuLaunchKernel\n"))<br>

                                                  return OFFLOAD_FAIL;</div>

                                          <div><br>

                                          </div>

                                          <div>Without the mutex. the

                                            first error from all the

                                            threads is "Error after

                                            cuLaunchKernel". After

                                            enabling the mutex, the code

                                            runs well. If I move the

                                            mutex unlock() before the

                                            cuCtxSynchronize, the code

                                            still runs into error. So I

                                            think the error comes from

                                            kernel execution not

                                            something else.</div>

                                          <div><br>

                                          </div>

                                          <div>2) on my application

                                            side, I tried to see what

                                            triggers the error. Once I

                                            commented out "omp parallel"

                                            in the hierarchical

                                            parallelism.

                                            <a href="https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258" target="_blank">

https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258</a><br>

                                          </div>

                                          <div>Multi host threads

                                            offloading start to run

                                            without error.</div>

                                          <div><br>

                                          </div>

                                          <div>I'd like to ask the wider

                                            community to see potential

                                            causes of this bug.<br>

                                          </div>

                                          <div>Is there a thread unsafe

                                            state machine with a kernel

                                            generated from a target

                                            region? Or any other

                                            potential reason for the

                                            error?</div>

                                          <div><br>

                                          </div>

                                          <div>Best,</div>

                                          <div>Ye<br>

                                          </div>

                                          <div>

                                            <div dir="ltr">

                                              <div dir="ltr">

                                                <div>

                                                  <div dir="ltr">===================<br>

                                                    Ye Luo, Ph.D.<br>

                                                    Computational

                                                    Science Division

                                                    & Leadership

                                                    Computing Facility<br>

                                                    Argonne National

                                                    Laboratory</div>

                                                </div>

                                              </div>

                                            </div>

                                          </div>

                                        </div>

                                        <br>

                                        <fieldset></fieldset>

                                        <pre>_______________________________________________

Openmp-dev mailing list

<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>

</pre>

                                      </blockquote>

                                    </div>

                                  </blockquote>

                                </div>

                                <br>

                                <fieldset></fieldset>

                                <pre>_______________________________________________

Openmp-dev mailing list

<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>

</pre>

                              </blockquote>

                            </div>

                          </blockquote>

                        </div>

                        <br>

                        <fieldset></fieldset>

                        <pre>_______________________________________________

Openmp-dev mailing list

<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>

</pre>

                      </blockquote>

                    </div>

                  </blockquote>

                </div>

                <span>_______________________________________________</span><br>

                <span>Openmp-dev mailing list</span><br>

                <span><a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a></span><br>

                <span><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a></span><br>

              </div>

            </blockquote>

          </div>

        </blockquote>

      </div>

    </blockquote>

  </div>

</blockquote></div>