<div dir="ltr"><div>Do you have an example using shared memory via #pragma omp allocate?</div><div>Ye<br></div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr">===================<br>
Ye Luo, Ph.D.<br>Computational Science Division & Leadership Computing Facility<br>
Argonne National Laboratory</div></div></div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020 at 2:09 PM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com">a.bataev@outlook.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Yes, reduction variables also can be candidates for globalization
if the reduction variable must be shared in parallel region. You
can try to control it manually using allocate clause.<br>
</p>
<pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
<div>25.06.2020 3:05 PM, Ye Luo пишет:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>I cannot use openmp-cuda-mode. I don't remember now exactly
where (reduction?) but it breaks other things in my code.<br>
</div>
<div>Ye<br>
</div>
<div>
<div>
<div>
<div dir="ltr">
<div dir="ltr">
<div>
<div dir="ltr">===================<br>
Ye Luo, Ph.D.<br>
Computational Science Division & Leadership
Computing Facility<br>
Argonne National Laboratory</div>
</div>
</div>
</div>
</div>
<br>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020 at 2:02
PM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>You can use shared memory directly by using #pragma omp
allocate and allocate clauses. Plus, shared memory is
actively used by the runtime. But I agree, that it would
be good to allow to control the size. <br>
</p>
<p>BTW, you can disable this globalization mechanism by
using -fopenmp-cuda-mode option. In this case, you need to
control sharing the variables yourself.<br>
</p>
<pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
<div>25.06.2020 2:51 PM, Ye Luo пишет:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>I think hard-coded size 128 is suboptimal. Can we
allow users to select that?</div>
<div>Since we cannot use shared memory directly, I
believe there are plenty of shared memory available.</div>
<div>It seems that my test code is exceeding 128 byte
and I'd like to set a higher limit.<br>
</div>
<div>Ye<br>
</div>
<div>
<div>
<div dir="ltr">
<div dir="ltr">
<div>
<div dir="ltr">===================<br>
Ye Luo, Ph.D.<br>
Computational Science Division &
Leadership Computing Facility<br>
Argonne National Laboratory</div>
</div>
</div>
</div>
</div>
<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Jun 25, 2020
at 1:02 PM Alexey Bataev <<a href="mailto:a.bataev@hotmail.com" target="_blank">a.bataev@hotmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="auto"> Globalized variables are the
variables that must be shared between the threads
but initially allocated by the master thread. On
GPU, local memory is private for threads, and the
local variables cannkt be shared between the
threads. But OpenMP requires that such variables
must be shared between the threads. So, if we see,
that the local variable is allocated in the context
of the master thread and later is shared in parallel
region (or potentially can be shared, if it may
escape its declaration context by the reference or
by the address), such variables are allocated either
in shared or in global memory. If the total size of
globalized variables is less than 128 bytes, they
are allocated in shared memory. Otherwise, fallback
to malloc. If -fno-penmp-cuda-parallel-target-region
is used, then again, if total size <= 128 bytes -
shared memory is used, otherwise statically
allocated global memory buffer is used.<br>
<br>
<div dir="ltr">Best regards,
<div>Alexey Bataev</div>
</div>
<div dir="ltr"><br>
<blockquote type="cite">25 июня 2020 г., в 13:45,
Ye Luo <<a href="mailto:xw111luoye@gmail.com" target="_blank">xw111luoye@gmail.com</a>>
написал(а):<br>
<br>
</blockquote>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">
<div>Could you add documentation for this
option?</div>
<div><span><tt>fopenmp-cuda-parallel-target-region</tt></span></div>
<div><span><tt>The name doesn't reflect the
actual effect.</tt></span></div>
<div><span><tt>The differential review
mentions</tt></span></div>
<div><span><tt>```<br>
</tt></span></div>
<div><span><tt>Added support for dynamic
memory allocation for globalized
variables in<br>
case if execution of target regions in
parallel is required.</tt></span></div>
<div><span><tt>```</tt></span></div>
<div><span><tt>So what are the globalized
variables? When do they occur? since it
impacts performance, application
developers desire to know more details.<br>
</tt></span></div>
<div><span><tt>```<br>
</tt></span></div>
<div><span><tt><span>
<p>It might use "slow" allocation
functions, in general, since it may
use malloc on the device side. Being
disabled, it uses statically
preallocated memory, which might be
faster, if parallel target regions
are not required.</p>
<p>```</p>
In the slow code path, does it always
call malloc in the device or malloc
gets called when exceeding a size
limit? Is the call explicitly
generated by clang or the cuda driver
handles it?<br>
</span></tt></span></div>
<div><span><tt><span>I saw some <span>UseSharedMemory
flags in the review. Is the slow
code path uses CUDA shared memory
and it may need global memory when
the size limit is exceeded?<br>
</span></span></tt></span></div>
<div><span><tt><span>In the fast code path,
what is the scope of this statically
preallocated memory? Is it owned by
the CUDA plugin?</span></tt></span></div>
<div><span><tt><span><br>
</span></tt></span></div>
<div><span><tt><span>Thanks,<br>
</span></tt></span></div>
<div><span><tt><span>Ye<br>
</span></tt></span></div>
<div>
<div>
<div dir="ltr">
<div dir="ltr">
<div>
<div dir="ltr">===================<br>
Ye Luo, Ph.D.<br>
Computational Science Division
& Leadership Computing
Facility<br>
Argonne National Laboratory</div>
</div>
</div>
</div>
</div>
<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Jun
25, 2020 at 9:44 AM Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Hi Ye, the patch is committed, you can
check if the compiler works for you.<br>
</p>
<pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
<div>22.06.2020 1:29 PM, Ye Luo via
Openmp-dev пишет:<br>
</div>
<blockquote type="cite">
<div dir="ltr">It is working well now. I
tested both the miniapp and the full
application.<br clear="all">
<div>
<div>
<div dir="ltr">
<div dir="ltr">
<div>
<div dir="ltr">===================<br>
Ye Luo, Ph.D.<br>
Computational Science
Division & Leadership
Computing Facility<br>
Argonne National
Laboratory</div>
</div>
</div>
</div>
</div>
<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On
Mon, Jun 22, 2020 at 11:56 AM
Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Hi Ye, could you try your code
with the attached patch for the
clang? <br>
</p>
<pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
<div>19.06.2020 6:45 PM, Ye Luo
via Openmp-dev пишет:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>Yes. It makes the code
working.</div>
<div>Do you have any idea
about the code pattern?<br>
</div>
<div>Before the compiler fix
gets in, I can put the
workaround in the real code
which has multiple places
similar to the miniapp.</div>
<div>Best,<br>
</div>
<div>Ye<br>
</div>
<div>
<div>
<div dir="ltr">
<div dir="ltr">
<div>
<div dir="ltr">===================<br>
Ye Luo, Ph.D.<br>
Computational
Science Division
& Leadership
Computing Facility<br>
Argonne National
Laboratory</div>
</div>
</div>
</div>
</div>
<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri,
Jun 19, 2020 at 4:35 PM
Alexey.Bataev <<a href="mailto:a.bataev@outlook.com" target="_blank">a.bataev@outlook.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Try to use `omp
parallel
firstprivate(first,
last,
offload_scratch_iVP_ptr,
psiinv_ptr, i)` as the
inner parallel
directive.<br>
</p>
<pre cols="72">-------------
Best regards,
Alexey Bataev</pre>
<div>16.06.2020 9:26 PM,
Ye Luo via Openmp-dev
пишет:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>Hi all,</div>
<div>I got
libomptarget failure
when offloading the
same kernel function
from multiple host
threads. I reported
the bug as <a href="https://bugs.llvm.org/show_bug.cgi?id=46257" target="_blank">https://bugs.llvm.org/show_bug.cgi?id=46257</a>.</div>
<div><br>
</div>
<div>After a bit
investigation, I'm
able to narrow down
a bit the scope of
the issue.</div>
<div>The error from
CUDA runtime is "an
illegal memory
access was
encountered".</div>
<div><br>
</div>
<div>
<div>A) On the
libomptarget side<br>
</div>
</div>
<div>In
libomptarget/plugins/cuda/src/rtl.cpp,
I added a few
synchronization
before and after
cuLaunchKernel and a
mutex to protect the
kernel execution.<br>
</div>
<div><br>
</div>
<div> Err =
cuCtxSynchronize();<br>
if
(!checkResult(Err,
"Error before
cuLaunchKernel\n"))<br>
return
OFFLOAD_FAIL;<br>
<br>
//kernelmtx.lock();<br>
Err =
cuLaunchKernel(KernelInfo->Func,
CudaBlocksPerGrid,
/* gridDimY */ 1,<br>
/* gridDimZ */
1,
CudaThreadsPerBlock,<br>
/* blockDimY */
1, /* blockDimZ */
1,<br>
/*
sharedMemBytes */ 0,
Stream,
&Args[0],
nullptr);<br>
if
(!checkResult(Err,
"Error returned from
cuLaunchKernel\n"))<br>
return
OFFLOAD_FAIL;<br>
<br>
Err =
cuCtxSynchronize();<br>
//kernelmtx.unlock();<br>
if
(!checkResult(Err,
"Error after
cuLaunchKernel\n"))<br>
return
OFFLOAD_FAIL;</div>
<div><br>
</div>
<div>Without the
mutex. the first
error from all the
threads is "Error
after
cuLaunchKernel".
After enabling the
mutex, the code runs
well. If I move the
mutex unlock()
before the
cuCtxSynchronize,
the code still runs
into error. So I
think the error
comes from kernel
execution not
something else.</div>
<div><br>
</div>
<div>2) on my
application side, I
tried to see what
triggers the error.
Once I commented out
"omp parallel" in
the hierarchical
parallelism. <a href="https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258" target="_blank">
https://github.com/QMCPACK/miniqmc/blob/5a11c4131dbf91bf9f9977057a03aee485368f0d/src/QMCWaveFunctions/einspline_spo_omp.cpp#L258</a><br>
</div>
<div>Multi host
threads offloading
start to run without
error.</div>
<div><br>
</div>
<div>I'd like to ask
the wider community
to see potential
causes of this bug.<br>
</div>
<div>Is there a thread
unsafe state machine
with a kernel
generated from a
target region? Or
any other potential
reason for the
error?</div>
<div><br>
</div>
<div>Best,</div>
<div>Ye<br>
</div>
<div>
<div dir="ltr">
<div dir="ltr">
<div>
<div dir="ltr">===================<br>
Ye Luo, Ph.D.<br>
Computational
Science
Division &
Leadership
Computing
Facility<br>
Argonne
National
Laboratory</div>
</div>
</div>
</div>
</div>
</div>
<br>
<fieldset></fieldset>
<pre>_______________________________________________
Openmp-dev mailing list
<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>
</pre>
</blockquote>
</div>
</blockquote>
</div>
<br>
<fieldset></fieldset>
<pre>_______________________________________________
Openmp-dev mailing list
<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>
</pre>
</blockquote>
</div>
</blockquote>
</div>
<br>
<fieldset></fieldset>
<pre>_______________________________________________
Openmp-dev mailing list
<a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a>
</pre>
</blockquote>
</div>
</blockquote>
</div>
<span>_______________________________________________</span><br>
<span>Openmp-dev mailing list</span><br>
<span><a href="mailto:Openmp-dev@lists.llvm.org" target="_blank">Openmp-dev@lists.llvm.org</a></span><br>
<span><a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev</a></span><br>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote></div>