<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>I think 3 is the best option in general, no?<br>
</p>
<div class="moz-cite-prefix">12/8/2020 7:23 PM, Johannes Doerfert
via cfe-dev пишет:<br>
</div>
<blockquote type="cite"
cite="mid:%3C4b65d30e-66a0-98c0-82e2-7b82d6bb7493@gmail.com%3E">Hi
Itay,
<br>
<br>
the problem is the "escape" of `local` later in the function (into
the
<br>
OpenMP reduction runtime call).
<br>
<br>
There are various solutions to this, on the compiler side, which
<br>
I'll try to describe briefly below. If you need a shortstop
solution
<br>
for this issue, you can introduce a secondary privatization
variable
<br>
yourself <a class="moz-txt-link-freetext" href="https://godbolt.org/z/3fxxTT">https://godbolt.org/z/3fxxTT</a> (untested). Each thread will
use
<br>
the user provided variable for accumulation and the OpenMP
reduction
<br>
facility is used for the final reduction across threads.
Obviously, this
<br>
is what the compiler should generate, so here we go:
<br>
<br>
1) Use `PointerMayBeCapturedBefore` instead of
`PointerMayBeCaptured` in
<br>
BasicAliasAnalysis.cpp (the call is done via
`isNonEscapingLocalObject`).
<br>
The reason we don't do this, I assume, is compile time. Maybe
someone
<br>
will try it out and measure the compile and runtime
implications but for
<br>
not I assume this to be a solution that might not be enacted
(though simple
<br>
and generic).
<br>
<br>
2) Modify the OpenMP reduction output to introduce the second
privatization
<br>
location. This should be relatively easy but it will only
address the problem
<br>
at hand. Similar problems pop up more often and I would prefer
1) or 3). That
<br>
said, we can for now enact 2) if someone writes the code ;)
<br>
<br>
3) Introduce an attribute/annotation that indicates this is
actually not an escaping
<br>
use. Such uses happen all the time (at least in the OpenMP
runtime) and it would
<br>
be ideal if we could indicate the memory store is not causing
the pointer to escape.
<br>
I'm thinking about this a bit more now, any input is welcome :)
<br>
<br>
I hope this helps, feel free to reach out if you have more
questions or comments.
<br>
<br>
~ Johannes
<br>
<br>
<br>
On 12/8/20 2:28 PM, Itay Bookstein via cfe-dev wrote:
<br>
<blockquote type="cite">Hey,
<br>
<br>
I've encountered a peculiar code generation issue in a
parallel-for-reduction.
<br>
Inside the per-thread reduction loop, I see a store to a stack
slot that happens
<br>
*every iteration*, clobbering the value written by the previous
one. The address
<br>
of that stack slot is later taken to pass to the inter-thread
<br>
reduction, but the store
<br>
is *not* hoisted outside the loop, while I'd expect it to happen
just
<br>
once outside it.
<br>
<br>
Attached below are the code example that triggers this, and an
annotated x86-64
<br>
assembly snippet from the outlined OMP function. The code was
compiled using
<br>
clang++-12 Ubuntu focal unstable branch, using the command-line:
<br>
clang++-12 -fopenmp -O3 main.cpp -o main
<br>
<br>
I'm wondering whether this is some sort of bug, and in which
component.
<br>
<br>
Regards,
<br>
~Itay
<br>
<br>
// main.cpp
<br>
#include <cstdio>
<br>
#include <memory>
<br>
<br>
double compute_dot_product(size_t n, double *xv, double *yv)
<br>
{
<br>
double local = 0.0;
<br>
#pragma omp parallel for reduction (+:local)
<br>
for (size_t i = 0; i < n; i++) local += xv[i] * yv[i];
<br>
return local;
<br>
}
<br>
<br>
int main(int argc, char **argv)
<br>
{
<br>
constexpr size_t n = 0x1000;
<br>
auto xv = std::make_unique<double[]>(n);
<br>
auto yv = std::make_unique<double[]>(n);
<br>
<br>
double result = compute_dot_product(n, xv.get(), yv.get());
<br>
printf("result = %e\n", result);
<br>
return 0;
<br>
}
<br>
<br>
// Disassembly excerpt from objdump -d --no-show-raw-insn main,
<br>
function <...>omp_outlined<...>
<br>
4012f0: movsd (%rdx,%rcx,8),%xmm1 ; <--- Non-unrolled loop
head
<br>
4012f5: mulsd (%rsi,%rcx,8),%xmm1
<br>
4012fa: addsd %xmm1,%xmm0
<br>
4012fe: movsd %xmm0,(%rsp) ; <--- Un-hoisted store every loop
iteration
<br>
401303: add $0x1,%rcx
<br>
401307: add $0xffffffffffffffff,%rax
<br>
40130b: jne 4012f0 <.omp_outlined.+0xc0> ; <---
Non-unrolled loop latch
<br>
40130d: cmp $0x3,%rbp
<br>
401311: jb 40138b <.omp_outlined.+0x15b> ; <---
x4-unrolled loop guard
<br>
401313: sub %rcx,%rdi ; <--- x4-unrolled loop preheader
<br>
401316: lea (%rsi,%rcx,8),%rsi
<br>
40131a: add $0x18,%rsi
<br>
40131e: lea (%rdx,%rcx,8),%rcx
<br>
401322: add $0x18,%rcx
<br>
401326: mov $0xffffffffffffffff,%rdx
<br>
40132d: nopl (%rax)
<br>
401330: movsd -0x10(%rcx,%rdx,8),%xmm1 ; <--- x4-unrolled
loop head
<br>
401336: mulsd -0x10(%rsi,%rdx,8),%xmm1
<br>
40133c: addsd %xmm0,%xmm1
<br>
401340: movsd %xmm1,(%rsp) ; <--- Weird store #1
<br>
401345: movsd -0x8(%rcx,%rdx,8),%xmm0
<br>
40134b: mulsd -0x8(%rsi,%rdx,8),%xmm0
<br>
401351: addsd %xmm1,%xmm0
<br>
401355: movsd %xmm0,(%rsp) ; <--- Weird store #2
<br>
40135a: movsd (%rcx,%rdx,8),%xmm1
<br>
40135f: mulsd (%rsi,%rdx,8),%xmm1
<br>
401364: addsd %xmm0,%xmm1
<br>
401368: movsd %xmm1,(%rsp) ; <--- Weird store #3
<br>
40136d: movsd 0x8(%rcx,%rdx,8),%xmm0
<br>
401373: mulsd 0x8(%rsi,%rdx,8),%xmm0
<br>
401379: addsd %xmm1,%xmm0
<br>
40137d: movsd %xmm0,(%rsp) ; <--- Weird store #4
<br>
401382: add $0x4,%rdx
<br>
401386: cmp %rdx,%rdi
<br>
401389: jne 401330 <.omp_outlined.+0x100> ; <---
x4-unrolled loop latch
<br>
40138b: mov $0x402028,%edi
<br>
401390: mov %r14d,%esi
<br>
401393: callq 401040 <__kmpc_for_static_fini@plt>
<br>
401398: mov %rsp,%rax ; <--- Load address of the stack slot
to pass to
<br>
reduction logic
<br>
40139b: mov %rax,0x20(%rsp)
<br>
_______________________________________________
<br>
cfe-dev mailing list
<br>
<a class="moz-txt-link-abbreviated" href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a>
<br>
<a class="moz-txt-link-freetext" href="https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a>
<br>
</blockquote>
<br>
</blockquote>
</body>
</html>