[cfe-dev] [OpenMP] Redundant store inside reduction loop body

Tue Dec 8 16:23:09 PST 2020

Hi Itay,

the problem is the "escape" of `local` later in the function (into the
OpenMP reduction runtime call).

There are various solutions to this, on the compiler side, which
I'll try to describe briefly below. If you need a shortstop solution
for this issue, you can introduce a secondary privatization variable
yourself https://godbolt.org/z/3fxxTT (untested). Each thread will use
the user provided variable for accumulation and the OpenMP reduction
facility is used for the final reduction across threads. Obviously, this
is what the compiler should generate, so here we go:

1) Use `PointerMayBeCapturedBefore` instead of `PointerMayBeCaptured` in
    BasicAliasAnalysis.cpp (the call is done via 
`isNonEscapingLocalObject`).
    The reason we don't do this, I assume, is compile time. Maybe someone
    will try it out and measure the compile and runtime implications but for
    not I assume this to be a solution that might not be enacted (though 
simple
    and generic).

2) Modify the OpenMP reduction output to introduce the second privatization
    location. This should be relatively easy but it will only address 
the problem
    at hand. Similar problems pop up more often and I would prefer 1) or 
3). That
    said, we can for now enact 2) if someone writes the code ;)

3) Introduce an attribute/annotation that indicates this is actually not 
an escaping
    use. Such uses happen all the time (at least in the OpenMP runtime) 
and it would
    be ideal if we could indicate the memory store is not causing the 
pointer to escape.
    I'm thinking about this a bit more now, any input is welcome :)

I hope this helps, feel free to reach out if you have more questions or 
comments.

~ Johannes

On 12/8/20 2:28 PM, Itay Bookstein via cfe-dev wrote:
> Hey,
>
> I've encountered a peculiar code generation issue in a parallel-for-reduction.
> Inside the per-thread reduction loop, I see a store to a stack slot that happens
> *every iteration*, clobbering the value written by the previous one. The address
> of that stack slot is later taken to pass to the inter-thread
> reduction, but the store
> is *not* hoisted outside the loop, while I'd expect it to happen just
> once outside it.
>
> Attached below are the code example that triggers this, and an annotated x86-64
> assembly snippet from the outlined OMP function. The code was compiled using
> clang++-12 Ubuntu focal unstable branch, using the command-line:
> clang++-12 -fopenmp -O3 main.cpp -o main
>
> I'm wondering whether this is some sort of bug, and in which component.
>
> Regards,
> ~Itay
>
> // main.cpp
> #include <cstdio>
> #include <memory>
>
> double compute_dot_product(size_t n, double *xv, double *yv)
> {
>    double local = 0.0;
>    #pragma omp parallel for reduction (+:local)
>    for (size_t i = 0; i < n; i++) local += xv[i] * yv[i];
>    return local;
> }
>
> int main(int argc, char **argv)
> {
>    constexpr size_t n = 0x1000;
>    auto xv = std::make_unique<double[]>(n);
>    auto yv = std::make_unique<double[]>(n);
>
>    double result = compute_dot_product(n, xv.get(), yv.get());
>    printf("result = %e\n", result);
>    return 0;
> }
>
> // Disassembly excerpt from objdump -d --no-show-raw-insn main,
> function <...>omp_outlined<...>
> 4012f0: movsd (%rdx,%rcx,8),%xmm1 ; <--- Non-unrolled loop head
> 4012f5: mulsd (%rsi,%rcx,8),%xmm1
> 4012fa: addsd %xmm1,%xmm0
> 4012fe: movsd %xmm0,(%rsp) ; <--- Un-hoisted store every loop iteration
> 401303: add $0x1,%rcx
> 401307: add $0xffffffffffffffff,%rax
> 40130b: jne 4012f0 <.omp_outlined.+0xc0>  ; <--- Non-unrolled loop latch
> 40130d: cmp $0x3,%rbp
> 401311: jb 40138b <.omp_outlined.+0x15b> ; <--- x4-unrolled loop guard
> 401313: sub %rcx,%rdi ; <--- x4-unrolled loop preheader
> 401316: lea (%rsi,%rcx,8),%rsi
> 40131a: add $0x18,%rsi
> 40131e: lea (%rdx,%rcx,8),%rcx
> 401322: add $0x18,%rcx
> 401326: mov $0xffffffffffffffff,%rdx
> 40132d: nopl (%rax)
> 401330: movsd -0x10(%rcx,%rdx,8),%xmm1 ; <--- x4-unrolled loop head
> 401336: mulsd -0x10(%rsi,%rdx,8),%xmm1
> 40133c: addsd %xmm0,%xmm1
> 401340: movsd %xmm1,(%rsp) ; <--- Weird store #1
> 401345: movsd -0x8(%rcx,%rdx,8),%xmm0
> 40134b: mulsd -0x8(%rsi,%rdx,8),%xmm0
> 401351: addsd %xmm1,%xmm0
> 401355: movsd %xmm0,(%rsp)  ; <--- Weird store #2
> 40135a: movsd (%rcx,%rdx,8),%xmm1
> 40135f: mulsd (%rsi,%rdx,8),%xmm1
> 401364: addsd %xmm0,%xmm1
> 401368: movsd %xmm1,(%rsp)  ; <--- Weird store #3
> 40136d: movsd 0x8(%rcx,%rdx,8),%xmm0
> 401373: mulsd 0x8(%rsi,%rdx,8),%xmm0
> 401379: addsd %xmm1,%xmm0
> 40137d: movsd %xmm0,(%rsp)  ; <--- Weird store #4
> 401382: add $0x4,%rdx
> 401386: cmp %rdx,%rdi
> 401389: jne 401330 <.omp_outlined.+0x100> ; <--- x4-unrolled loop latch
> 40138b: mov $0x402028,%edi
> 401390: mov %r14d,%esi
> 401393: callq 401040 <__kmpc_for_static_fini at plt>
> 401398: mov %rsp,%rax ; <--- Load address of the stack slot to pass to
> reduction logic
> 40139b: mov %rax,0x20(%rsp)
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev