[cfe-dev] [OpenMP] Redundant store inside reduction loop body

Wed Dec 9 06:10:46 PST 2020

I think 3 is the best option in general.

-------------
Best regards,
Alexey Bataev

12/8/2020 7:23 PM, Johannes Doerfert via cfe-dev пишет:
> Hi Itay,
>
> the problem is the "escape" of `local` later in the function (into the
> OpenMP reduction runtime call).
>
> There are various solutions to this, on the compiler side, which
> I'll try to describe briefly below. If you need a shortstop solution
> for this issue, you can introduce a secondary privatization variable
> yourself https://godbolt.org/z/3fxxTT (untested). Each thread will use
> the user provided variable for accumulation and the OpenMP reduction
> facility is used for the final reduction across threads. Obviously, this
> is what the compiler should generate, so here we go:
>
> 1) Use `PointerMayBeCapturedBefore` instead of `PointerMayBeCaptured` in
>    BasicAliasAnalysis.cpp (the call is done via
> `isNonEscapingLocalObject`).
>    The reason we don't do this, I assume, is compile time. Maybe someone
>    will try it out and measure the compile and runtime implications
> but for
>    not I assume this to be a solution that might not be enacted
> (though simple
>    and generic).
>
> 2) Modify the OpenMP reduction output to introduce the second
> privatization
>    location. This should be relatively easy but it will only address
> the problem
>    at hand. Similar problems pop up more often and I would prefer 1)
> or 3). That
>    said, we can for now enact 2) if someone writes the code ;)
>
> 3) Introduce an attribute/annotation that indicates this is actually
> not an escaping
>    use. Such uses happen all the time (at least in the OpenMP runtime)
> and it would
>    be ideal if we could indicate the memory store is not causing the
> pointer to escape.
>    I'm thinking about this a bit more now, any input is welcome :)
>
> I hope this helps, feel free to reach out if you have more questions
> or comments.
>
> ~ Johannes
>
>
> On 12/8/20 2:28 PM, Itay Bookstein via cfe-dev wrote:
>> Hey,
>>
>> I've encountered a peculiar code generation issue in a
>> parallel-for-reduction.
>> Inside the per-thread reduction loop, I see a store to a stack slot
>> that happens
>> *every iteration*, clobbering the value written by the previous one.
>> The address
>> of that stack slot is later taken to pass to the inter-thread
>> reduction, but the store
>> is *not* hoisted outside the loop, while I'd expect it to happen just
>> once outside it.
>>
>> Attached below are the code example that triggers this, and an
>> annotated x86-64
>> assembly snippet from the outlined OMP function. The code was
>> compiled using
>> clang++-12 Ubuntu focal unstable branch, using the command-line:
>> clang++-12 -fopenmp -O3 main.cpp -o main
>>
>> I'm wondering whether this is some sort of bug, and in which component.
>>
>> Regards,
>> ~Itay
>>
>> // main.cpp
>> #include <cstdio>
>> #include <memory>
>>
>> double compute_dot_product(size_t n, double *xv, double *yv)
>> {
>>    double local = 0.0;
>>    #pragma omp parallel for reduction (+:local)
>>    for (size_t i = 0; i < n; i++) local += xv[i] * yv[i];
>>    return local;
>> }
>>
>> int main(int argc, char **argv)
>> {
>>    constexpr size_t n = 0x1000;
>>    auto xv = std::make_unique<double[]>(n);
>>    auto yv = std::make_unique<double[]>(n);
>>
>>    double result = compute_dot_product(n, xv.get(), yv.get());
>>    printf("result = %e\n", result);
>>    return 0;
>> }
>>
>> // Disassembly excerpt from objdump -d --no-show-raw-insn main,
>> function <...>omp_outlined<...>
>> 4012f0: movsd (%rdx,%rcx,8),%xmm1 ; <--- Non-unrolled loop head
>> 4012f5: mulsd (%rsi,%rcx,8),%xmm1
>> 4012fa: addsd %xmm1,%xmm0
>> 4012fe: movsd %xmm0,(%rsp) ; <--- Un-hoisted store every loop iteration
>> 401303: add $0x1,%rcx
>> 401307: add $0xffffffffffffffff,%rax
>> 40130b: jne 4012f0 <.omp_outlined.+0xc0>  ; <--- Non-unrolled loop latch
>> 40130d: cmp $0x3,%rbp
>> 401311: jb 40138b <.omp_outlined.+0x15b> ; <--- x4-unrolled loop guard
>> 401313: sub %rcx,%rdi ; <--- x4-unrolled loop preheader
>> 401316: lea (%rsi,%rcx,8),%rsi
>> 40131a: add $0x18,%rsi
>> 40131e: lea (%rdx,%rcx,8),%rcx
>> 401322: add $0x18,%rcx
>> 401326: mov $0xffffffffffffffff,%rdx
>> 40132d: nopl (%rax)
>> 401330: movsd -0x10(%rcx,%rdx,8),%xmm1 ; <--- x4-unrolled loop head
>> 401336: mulsd -0x10(%rsi,%rdx,8),%xmm1
>> 40133c: addsd %xmm0,%xmm1
>> 401340: movsd %xmm1,(%rsp) ; <--- Weird store #1
>> 401345: movsd -0x8(%rcx,%rdx,8),%xmm0
>> 40134b: mulsd -0x8(%rsi,%rdx,8),%xmm0
>> 401351: addsd %xmm1,%xmm0
>> 401355: movsd %xmm0,(%rsp)  ; <--- Weird store #2
>> 40135a: movsd (%rcx,%rdx,8),%xmm1
>> 40135f: mulsd (%rsi,%rdx,8),%xmm1
>> 401364: addsd %xmm0,%xmm1
>> 401368: movsd %xmm1,(%rsp)  ; <--- Weird store #3
>> 40136d: movsd 0x8(%rcx,%rdx,8),%xmm0
>> 401373: mulsd 0x8(%rsi,%rdx,8),%xmm0
>> 401379: addsd %xmm1,%xmm0
>> 40137d: movsd %xmm0,(%rsp)  ; <--- Weird store #4
>> 401382: add $0x4,%rdx
>> 401386: cmp %rdx,%rdi
>> 401389: jne 401330 <.omp_outlined.+0x100> ; <--- x4-unrolled loop latch
>> 40138b: mov $0x402028,%edi
>> 401390: mov %r14d,%esi
>> 401393: callq 401040 <__kmpc_for_static_fini at plt>
>> 401398: mov %rsp,%rax ; <--- Load address of the stack slot to pass to
>> reduction logic
>> 40139b: mov %rax,0x20(%rsp)
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20201209/0a93d2fa/attachment-0001.html>