[cfe-dev] [OpenMP] Redundant store inside reduction loop body

Wed Dec 9 08:32:06 PST 2020

I agree, 3 is preferable, but also the hardest :D

I'll continue to think about this. As I said, there is a user
code short stop and a clang short stop solution if this is time
critical.

I'll send an RFC for 3) as soon as I have a coherent model;
suggestions welcome.

~ Johannes

On 12/9/20 8:10 AM, Alexey.Bataev wrote:
> I think 3 is the best option in general.
>
> -------------
> Best regards,
> Alexey Bataev
>
> 12/8/2020 7:23 PM, Johannes Doerfert via cfe-dev пишет:
>> Hi Itay,
>>
>> the problem is the "escape" of `local` later in the function (into the
>> OpenMP reduction runtime call).
>>
>> There are various solutions to this, on the compiler side, which
>> I'll try to describe briefly below. If you need a shortstop solution
>> for this issue, you can introduce a secondary privatization variable
>> yourself https://godbolt.org/z/3fxxTT (untested). Each thread will use
>> the user provided variable for accumulation and the OpenMP reduction
>> facility is used for the final reduction across threads. Obviously, this
>> is what the compiler should generate, so here we go:
>>
>> 1) Use `PointerMayBeCapturedBefore` instead of `PointerMayBeCaptured` in
>>     BasicAliasAnalysis.cpp (the call is done via
>> `isNonEscapingLocalObject`).
>>     The reason we don't do this, I assume, is compile time. Maybe someone
>>     will try it out and measure the compile and runtime implications
>> but for
>>     not I assume this to be a solution that might not be enacted
>> (though simple
>>     and generic).
>>
>> 2) Modify the OpenMP reduction output to introduce the second
>> privatization
>>     location. This should be relatively easy but it will only address
>> the problem
>>     at hand. Similar problems pop up more often and I would prefer 1)
>> or 3). That
>>     said, we can for now enact 2) if someone writes the code ;)
>>
>> 3) Introduce an attribute/annotation that indicates this is actually
>> not an escaping
>>     use. Such uses happen all the time (at least in the OpenMP runtime)
>> and it would
>>     be ideal if we could indicate the memory store is not causing the
>> pointer to escape.
>>     I'm thinking about this a bit more now, any input is welcome :)
>>
>> I hope this helps, feel free to reach out if you have more questions
>> or comments.
>>
>> ~ Johannes
>>
>>
>> On 12/8/20 2:28 PM, Itay Bookstein via cfe-dev wrote:
>>> Hey,
>>>
>>> I've encountered a peculiar code generation issue in a
>>> parallel-for-reduction.
>>> Inside the per-thread reduction loop, I see a store to a stack slot
>>> that happens
>>> *every iteration*, clobbering the value written by the previous one.
>>> The address
>>> of that stack slot is later taken to pass to the inter-thread
>>> reduction, but the store
>>> is *not* hoisted outside the loop, while I'd expect it to happen just
>>> once outside it.
>>>
>>> Attached below are the code example that triggers this, and an
>>> annotated x86-64
>>> assembly snippet from the outlined OMP function. The code was
>>> compiled using
>>> clang++-12 Ubuntu focal unstable branch, using the command-line:
>>> clang++-12 -fopenmp -O3 main.cpp -o main
>>>
>>> I'm wondering whether this is some sort of bug, and in which component.
>>>
>>> Regards,
>>> ~Itay
>>>
>>> // main.cpp
>>> #include <cstdio>
>>> #include <memory>
>>>
>>> double compute_dot_product(size_t n, double *xv, double *yv)
>>> {
>>>     double local = 0.0;
>>>     #pragma omp parallel for reduction (+:local)
>>>     for (size_t i = 0; i < n; i++) local += xv[i] * yv[i];
>>>     return local;
>>> }
>>>
>>> int main(int argc, char **argv)
>>> {
>>>     constexpr size_t n = 0x1000;
>>>     auto xv = std::make_unique<double[]>(n);
>>>     auto yv = std::make_unique<double[]>(n);
>>>
>>>     double result = compute_dot_product(n, xv.get(), yv.get());
>>>     printf("result = %e\n", result);
>>>     return 0;
>>> }
>>>
>>> // Disassembly excerpt from objdump -d --no-show-raw-insn main,
>>> function <...>omp_outlined<...>
>>> 4012f0: movsd (%rdx,%rcx,8),%xmm1 ; <--- Non-unrolled loop head
>>> 4012f5: mulsd (%rsi,%rcx,8),%xmm1
>>> 4012fa: addsd %xmm1,%xmm0
>>> 4012fe: movsd %xmm0,(%rsp) ; <--- Un-hoisted store every loop iteration
>>> 401303: add $0x1,%rcx
>>> 401307: add $0xffffffffffffffff,%rax
>>> 40130b: jne 4012f0 <.omp_outlined.+0xc0>  ; <--- Non-unrolled loop latch
>>> 40130d: cmp $0x3,%rbp
>>> 401311: jb 40138b <.omp_outlined.+0x15b> ; <--- x4-unrolled loop guard
>>> 401313: sub %rcx,%rdi ; <--- x4-unrolled loop preheader
>>> 401316: lea (%rsi,%rcx,8),%rsi
>>> 40131a: add $0x18,%rsi
>>> 40131e: lea (%rdx,%rcx,8),%rcx
>>> 401322: add $0x18,%rcx
>>> 401326: mov $0xffffffffffffffff,%rdx
>>> 40132d: nopl (%rax)
>>> 401330: movsd -0x10(%rcx,%rdx,8),%xmm1 ; <--- x4-unrolled loop head
>>> 401336: mulsd -0x10(%rsi,%rdx,8),%xmm1
>>> 40133c: addsd %xmm0,%xmm1
>>> 401340: movsd %xmm1,(%rsp) ; <--- Weird store #1
>>> 401345: movsd -0x8(%rcx,%rdx,8),%xmm0
>>> 40134b: mulsd -0x8(%rsi,%rdx,8),%xmm0
>>> 401351: addsd %xmm1,%xmm0
>>> 401355: movsd %xmm0,(%rsp)  ; <--- Weird store #2
>>> 40135a: movsd (%rcx,%rdx,8),%xmm1
>>> 40135f: mulsd (%rsi,%rdx,8),%xmm1
>>> 401364: addsd %xmm0,%xmm1
>>> 401368: movsd %xmm1,(%rsp)  ; <--- Weird store #3
>>> 40136d: movsd 0x8(%rcx,%rdx,8),%xmm0
>>> 401373: mulsd 0x8(%rsi,%rdx,8),%xmm0
>>> 401379: addsd %xmm1,%xmm0
>>> 40137d: movsd %xmm0,(%rsp)  ; <--- Weird store #4
>>> 401382: add $0x4,%rdx
>>> 401386: cmp %rdx,%rdi
>>> 401389: jne 401330 <.omp_outlined.+0x100> ; <--- x4-unrolled loop latch
>>> 40138b: mov $0x402028,%edi
>>> 401390: mov %r14d,%esi
>>> 401393: callq 401040 <__kmpc_for_static_fini at plt>
>>> 401398: mov %rsp,%rax ; <--- Load address of the stack slot to pass to
>>> reduction logic
>>> 40139b: mov %rax,0x20(%rsp)
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev