<div dir="ltr"><div>Thank you. I will try that.</div><div>Ye<br></div><div><div><div dir="ltr" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr">===================<br>
Ye Luo, Ph.D.<br>Computational Science Division & Leadership Computing Facility<br>
Argonne National Laboratory</div></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020 at 2:16 PM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div>
    <p>int a;<br>
    </p>
    <p>#pragma omp allocate(a) allocator(omp_pteam_mem_alloc)</p>
    <p>or for privatized variables #pragma omp ... private(a)
      allocate(omp_pteam_mem_alloc: a)</p>
    <p>But you should be very careful here too, it might lead to data
      race. The variables are still statically allocated and, probably,
      to avoid data race, you will need to create an array and allocate
      memory for the array of the variables.<br>
    </p>
    <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
    <div>25.06.2020 3:10 PM, Ye Luo пишет:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">
        <div>Do you have an example using shared memory via #pragma omp
          allocate?</div>
        <div>Ye<br>
        </div>
        <div>
          <div dir="ltr">
            <div dir="ltr">
              <div>
                <div dir="ltr">===================<br>
                  Ye Luo, Ph.D.<br>
                  Computational Science Division & Leadership
                  Computing Facility<br>
                  Argonne National Laboratory</div>
              </div>
            </div>
          </div>
        </div>
        <br>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020 at 2:09
          PM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>> wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div>
            <p>Yes, reduction variables also can be candidates for
              globalization if the reduction variable must be shared in
              parallel region. You can try to control it manually using
              allocate clause.<br>
            </p>
            <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
            <div>25.06.2020 3:05 PM, Ye Luo пишет:<br>
            </div>
            <blockquote type="cite">
              <div dir="ltr">
                <div>I cannot use openmp-cuda-mode. I don't remember now
                  exactly where (reduction?) but it breaks other things
                  in my code.<br>
                </div>
                <div>Ye<br>
                </div>
                <div>
                  <div>
                    <div>
                      <div dir="ltr">
                        <div dir="ltr">
                          <div>
                            <div dir="ltr">===================<br>
                              Ye Luo, Ph.D.<br>
                              Computational Science Division &
                              Leadership Computing Facility<br>
                              Argonne National Laboratory</div>
                          </div>
                        </div>
                      </div>
                    </div>
                    <br>
                  </div>
                </div>
              </div>
              <br>
              <div class="gmail_quote">
                <div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020
                  at 2:02 PM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>
                  wrote:<br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                  <div>
                    <p>You can use shared memory directly by using
                      #pragma omp allocate and allocate clauses. Plus,
                      shared memory is actively used by the runtime. But
                      I agree, that it would be good to allow to control
                      the size. <br>
                    </p>
                    <p>BTW, you can disable this globalization mechanism
                      by using -fopenmp-cuda-mode option. In this case,
                      you need to control sharing the variables
                      yourself.<br>
                    </p>
                    <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
                    <div>25.06.2020 2:51 PM, Ye Luo пишет:<br>
                    </div>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div>I think hard-coded size 128 is suboptimal.
                          Can we allow users to select that?</div>
                        <div>Since we cannot use shared memory directly,
                          I believe there are plenty of shared memory
                          available.</div>
                        <div>It seems that my test code is exceeding 128
                          byte and I'd like to set a higher limit.<br>
                        </div>
                        <div>Ye<br>
                        </div>
                        <div>
                          <div>
                            <div dir="ltr">
                              <div dir="ltr">
                                <div>
                                  <div dir="ltr">===================<br>
                                    Ye Luo, Ph.D.<br>
                                    Computational Science Division &
                                    Leadership Computing Facility<br>
                                    Argonne National Laboratory</div>
                                </div>
                              </div>
                            </div>
                          </div>
                          <br>
                        </div>
                      </div>
                      <br>
                      <div class="gmail_quote">
                        <div dir="ltr" class="gmail_attr">On Thu, Jun
                          25, 2020 at 1:02 PM Alexey Bataev <<a href="mailto:a.bataev@hotmail.com" target="_blank">a.bataev@hotmail.com</a>>
                          wrote:<br>
                        </div>
                        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                          <div dir="auto"> Globalized variables are the
                            variables that must be shared between the
                            threads but initially allocated by the
                            master thread. On GPU, local memory is
                            private for threads, and the local variables
                            cannkt be shared between the threads. But
                            OpenMP requires that such variables must be
                            shared between the threads. So, if we see,
                            that the local variable is allocated in the
                            context of the master thread and later is
                            shared in parallel region (or potentially
                            can be shared, if it may escape its
                            declaration context by the reference or by
                            the address), such variables are allocated
                            either in shared or in global memory. If the
                            total size of globalized variables is less
                            than 128 bytes, they are allocated in shared
                            memory. Otherwise, fallback to malloc. If
                            -fno-penmp-cuda-parallel-target-region is
                            used, then again, if total size <= 128
                            bytes - shared memory is used, otherwise
                            statically allocated global memory buffer is
                            used.<br>
                            <br>
                            <div dir="ltr">Best regards,
                              <div>Alexey Bataev</div>
                            </div>
                            <div dir="ltr"><br>
                              <blockquote type="cite">25 июня 2020 г., в
                                13:45, Ye Luo <<a href="mailto:xw111luoye@gmail.com" target="_blank">xw111luoye@gmail.com</a>>
                                написал(а):<br>
                                <br>
                              </blockquote>
                            </div>
                            <blockquote type="cite">
                              <div dir="ltr">
                                <div dir="ltr">
                                  <div>Could you add documentation for
                                    this option?</div>
                                  <div><span><tt>fopenmp-cuda-parallel-target-region</tt></span></div>
                                  <div><span><tt>The name doesn't
                                        reflect the actual effect.</tt></span></div>
                                  <div><span><tt>The differential review
                                        mentions</tt></span></div>
                                  <div><span><tt>```<br>
                                      </tt></span></div>
                                  <div><span><tt>Added support for
                                        dynamic memory allocation for
                                        globalized variables in<br>
                                        case if execution of target
                                        regions in parallel is required.</tt></span></div>
                                  <div><span><tt>```</tt></span></div>
                                  <div><span><tt>So what are the
                                        globalized variables? When do
                                        they occur? since it impacts
                                        performance, application
                                        developers desire to know more
                                        details.<br>
                                      </tt></span></div>
                                  <div><span><tt>```<br>
                                      </tt></span></div>
                                  <div><span><tt><span>
                                          <p>It might use "slow"
                                            allocation functions, in
                                            general, since it may use
                                            malloc on the device side.
                                            Being disabled, it uses
                                            statically preallocated
                                            memory, which might be
                                            faster, if parallel target
                                            regions are not required.</p>
                                          <p>```</p>
                                          In the slow code path, does it
                                          always call malloc in the
                                          device or malloc gets called
                                          when exceeding a size limit?
                                          Is the call explicitly
                                          generated by clang or the cuda
                                          driver handles it?<br>
                                        </span></tt></span></div>
                                  <div><span><tt><span>I saw some <span>UseSharedMemory
                                            flags in the review. Is the
                                            slow code path uses CUDA
                                            shared memory and it may
                                            need global memory when the
                                            size limit is exceeded?<br>
                                          </span></span></tt></span></div>
                                  <div><span><tt><span>In the fast code
                                          path, what is the scope of
                                          this statically preallocated
                                          memory? Is it owned by the
                                          CUDA plugin?</span></tt></span></div>
                                  <div><span><tt><span><br>
                                        </span></tt></span></div>
                                  <div><span><tt><span>Thanks,<br>
                                        </span></tt></span></div>
                                  <div><span><tt><span>Ye<br>
                                        </span></tt></span></div>
                                  <div>
                                    <div>
                                      <div dir="ltr">
                                        <div dir="ltr">
                                          <div>
                                            <div dir="ltr">===================<br>
                                              Ye Luo, Ph.D.<br>
                                              Computational Science
                                              Division & Leadership
                                              Computing Facility<br>
                                              Argonne National
                                              Laboratory</div>
                                          </div>
                                        </div>
                                      </div>
                                    </div>
                                    <br>
                                  </div>
                                </div>
                                <br>
                                <div class="gmail_quote">
                                  <div dir="ltr" class="gmail_attr">On
                                    Thu, Jun 25, 2020 at 9:44 AM
                                    Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>
                                    wrote:<br>
                                  </div>
                                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                    <div>
                                      <p>Hi Ye, the patch is committed,
                                        you can check if the compiler
                                        works for you.<br>
                                      </p>
                                      <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
                                      <div>22.06.2020 1:29 PM, Ye Luo
                                        via Openmp-dev пишет:<br>
                                      </div>
                                      <blockquote type="cite">
                                        <div dir="ltr">It is working
                                          well now. I tested both the
                                          miniapp and the full
                                          application.<br clear="all">
                                          <div>
                                            <div>
                                              <div dir="ltr">
                                                <div dir="ltr">
                                                  <div>
                                                    <div dir="ltr">===================<br>
                                                      Ye Luo, Ph.D.<br>
                                                      Computational
                                                      Science Division
                                                      & Leadership
                                                      Computing Facility<br>
                                                      Argonne National
                                                      Laboratory</div>
                                                  </div>
                                                </div>
                                              </div>
                                            </div>
                                            <br>
                                          </div>
                                        </div>
                                        <br>
                                        <div class="gmail_quote">
                                          <div dir="ltr" class="gmail_attr">On Mon,
                                            Jun 22, 2020 at 11:56 AM
                                            Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>
                                            wrote:<br>
                                          </div>
                                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                            <div>
                                              <p>Hi Ye, could you try
                                                your code with the
                                                attached patch for the
                                                clang? <br>
                                              </p>
                                              <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
                                              <div>19.06.2020 6:45 PM,
                                                Ye Luo via Openmp-dev
                                                пишет:<br>
                                              </div>
                                              <blockquote type="cite">
                                                <div dir="ltr">
                                                  <div>Yes. It makes the
                                                    code working.</div>
                                                  <div>Do you have any
                                                    idea about the code
                                                    pattern?<br>
                                                  </div>
                                                  <div>Before the
                                                    compiler fix gets
                                                    in, I can put the
                                                    workaround in the
                                                    real code which has
                                                    multiple places
                                                    similar to the
                                                    miniapp.</div>
                                                  <div>Best,<br>
                                                  </div>
                                                  <div>Ye<br>
                                                  </div>
                                                  <div>
                                                    <div>
                                                      <div dir="ltr">
                                                        <div dir="ltr">
                                                          <div>
                                                          <div dir="ltr">===================<br>
                                                          Ye Luo, Ph.D.<br>
                                                          Computational
                                                          Science
                                                          Division &
                                                          Leadership
                                                          Computing
                                                          Facility<br>
                                                          Argonne
                                                          National
                                                          Laboratory</div>
                                                          </div>
                                                        </div>
                                                      </div>
                                                    </div>
                                                    <br>
                                                  </div>
                                                </div>
                                                <br>
                                                <div class="gmail_quote">
                                                  <div dir="ltr" class="gmail_attr">On
                                                    Fri, Jun 19, 2020 at
                                                    4:35 PM
                                                    Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>
                                                    wrote:<br>
                                                  </div>
                                                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                                    <div>
                                                      <p>Try to use `omp
                                                        parallel
                                                        firstprivate(first,
                                                        last,
                                                        offload_scratch_iVP_ptr,
                                                        psiinv_ptr, i)`
                                                        as the inner
                                                        parallel
                                                        directive.<br>
                                                      </p>
                                                      <pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
                                                      <div>16.06.2020
                                                        9:26 PM, Ye Luo
                                                        via Openmp-dev
                                                        пишет:<br>
                                                      </div>
                                                      <blockquote type="cite">
                                                        <div dir="ltr">
                                                          <div>Hi all,</div>
                                                          <div>I got
                                                          libomptarget
                                                          failure when
                                                          offloading the
                                                          same kernel
                                                          function from
                                                          multiple host
                                                          threads. I
                                                          reported the
                                                          bug as <a href="https://bugs.llvm.org/show_bug.cgi?id=46257" target="_blank">https://bugs.llvm.org/show_bug.cgi?id=46257</a>.</div>
                                                          <div><br>
                                                          </div>
                                                          <div>After a
                                                          bit
                                                          investigation,
                                                          I'm able to
                                                          narrow down a
                                                          bit the scope
                                                          of the issue.</div>
                                                          <div>The error
                                                          from CUDA
                                                          runtime is "an
                                                          illegal memory
                                                          access was
                                                          encountered".</div>
                                                          <div><br>
                                                          </div>
                                                          <div>
                                                          <div>A) On the
                                                          libomptarget
                                                          side<br>
                                                          </div>
                                                          </div>
                                                          <div>In
                                                          libomptarget/plugins/cuda/src/rtl.cpp,
                                                          I added a few
synchronization before and after cuLaunchKernel and a mutex to protect
                                                          the kernel
                                                          execution.<br>
                                                          </div>
                                                          <div><br>
                                                          </div>
                                                          <div>    Err =
cuCtxSynchronize();<br>
                                                              if
                                                          (!checkResult(Err,
                                                          "Error before
cuLaunchKernel\n"))<br>
                                                                return
                                                          OFFLOAD_FAIL;<br>
                                                          <br>
                                                             
                                                          //kernelmtx.lock();<br>
                                                              Err =
                                                          cuLaunchKernel(KernelInfo->Func,
CudaBlocksPerGrid, /* gridDimY */ 1,<br>
                                                                       
                                                                     /*
                                                          gridDimZ */ 1,
CudaThreadsPerBlock,<br>
                                                                       
                                                                     /*
                                                          blockDimY */
                                                          1, /*
                                                          blockDimZ */
                                                          1,<br>
                                                                       
                                                                     /*
                                                          sharedMemBytes
                                                          */ 0, Stream,
                                                          &Args[0],
                                                          nullptr);<br>
                                                              if
                                                          (!checkResult(Err,
                                                          "Error
                                                          returned from
cuLaunchKernel\n"))<br>
                                                                return
                                                          OFFLOAD_FAIL;<br>
                                                          <br>
                                                              Err =
                                                          cuCtxSynchronize();<br>
                                                             
                                                          //kernelmtx.unlock();<br>
                                                              if
                                                          (!checkResult(Err,
                                                          "Error after
                                                          cuLaunchKernel\n"))<br>
                                                                return
                                                          OFFLOAD_FAIL;</div>
                                                          <div><br>
                                                          </div>
                                                          <div>Without
                                                          the mutex. the
                                                          first error
                                                          from all the
                                                          threads is
                                                          "Error after
                                                          cuLaunchKernel".
                                                          After enabling
                                                          the mutex, the
                                                          code runs
                                                          well. If I
                                                          move the mutex
                                                          unlock()
                                                          before the
                                                          cuCtxSynchronize,
                                                          the code still
                                                          runs into
                                                          error. So I
                                                          think the
                                                          error comes
                                                          from kernel
                                                          execution not
                                                          something
                                                          else.</div>
                                                          <div><br>
                                                          </div>
                                                          <div>2) on my
                                                          application
                                                          side, I tried
                                                          to see what
                                                          triggers the
                                                          error. Once I
                                                          commented out
                                                          "omp parallel"
                                                          in the
                                                          hierarchical
                                                          parallelism. <a href="https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258" target="_blank">
https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258</a><br>
                                                          </div>
                                                          <div>Multi
                                                          host threads
                                                          offloading
                                                          start to run
                                                          without error.</div>
                                                          <div><br>
                                                          </div>
                                                          <div>I'd like
                                                          to ask the
                                                          wider
                                                          community to
                                                          see potential
                                                          causes of this
                                                          bug.<br>
                                                          </div>
                                                          <div>Is there
                                                          a thread
                                                          unsafe state
                                                          machine with a
                                                          kernel
                                                          generated from
                                                          a target
                                                          region? Or any
                                                          other
                                                          potential
                                                          reason for the
                                                          error?</div>
                                                          <div><br>
                                                          </div>
                                                          <div>Best,</div>
                                                          <div>Ye<br>
                                                          </div>
                                                          <div>
                                                          <div dir="ltr">
                                                          <div dir="ltr">
                                                          <div>
                                                          <div dir="ltr">===================<br>
                                                          Ye Luo, Ph.D.<br>
                                                          Computational
                                                          Science
                                                          Division &
                                                          Leadership
                                                          Computing
                                                          Facility<br>
                                                          Argonne
                                                          National
                                                          Laboratory</div>
                                                          </div>
                                                          </div>
                                                          </div>
                                                          </div>
                                                        </div>
                                                        <br>
                                                        <fieldset></fieldset>
                                                        <pre>_______________________________________________
Openmp-dev mailing list
<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>
</pre>
                                                      </blockquote>
                                                    </div>
                                                  </blockquote>
                                                </div>
                                                <br>
                                                <fieldset></fieldset>
                                                <pre>_______________________________________________
Openmp-dev mailing list
<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>
</pre>
                                              </blockquote>
                                            </div>
                                          </blockquote>
                                        </div>
                                        <br>
                                        <fieldset></fieldset>
                                        <pre>_______________________________________________
Openmp-dev mailing list
<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>
</pre>
                                      </blockquote>
                                    </div>
                                  </blockquote>
                                </div>
                                <span>_______________________________________________</span><br>
                                <span>Openmp-dev mailing list</span><br>
                                <span><a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a></span><br>
                                <span><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a></span><br>
                              </div>
                            </blockquote>
                          </div>
                        </blockquote>
                      </div>
                    </blockquote>
                  </div>
                </blockquote>
              </div>
            </blockquote>
          </div>
        </blockquote>
      </div>
    </blockquote>
  </div>

</blockquote></div>