[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop

Tue Jul 24 11:41:06 PDT 2018

On 07/24/2018 12:07 PM, Craig Topper wrote:
> With maximize-bandwidth I'd still end up with the extra vpxors above
> the loop and the extra addition reduction steps at the end that we get
> from forcing the vf to 16 right?

Yea, I think we'd need something else to help with that. My underlying
thought here is that, based on your description, we really do want a VF
of 16 (because we want to use 256-bit loads, etc.). And so, when
thinking about how to fix things, we should start with looking at the VF
= 16 output, and not the VF = 8 output, as the right starting point for
the backend.

 -Hal

>
> ~Craig
>
>
> On Tue, Jul 24, 2018 at 10:04 AM Craig Topper <craig.topper at gmail.com
> <mailto:craig.topper at gmail.com>> wrote:
>
>
>
>     On Tue, Jul 24, 2018 at 6:10 AM Hal Finkel <hfinkel at anl.gov
>     <mailto:hfinkel at anl.gov>> wrote:
>
>
>         On 07/23/2018 06:37 PM, Craig Topper wrote:
>>
>>         ~Craig
>>
>>
>>         On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <hfinkel at anl.gov
>>         <mailto:hfinkel at anl.gov>> wrote:
>>
>>
>>             On 07/23/2018 05:22 PM, Craig Topper wrote:
>>>             Hello all,
>>>
>>>             This code https://godbolt.org/g/tTyxpf is a dot product
>>>             reduction loop multipying sign extended 16-bit values to
>>>             produce a 32-bit accumulated result. The x86 backend is
>>>             currently not able to optimize it as well as gcc and
>>>             icc. The IR we are getting from the loop vectorizer has
>>>             several v8i32 adds and muls inside the loop. These are
>>>             fed by v8i16 loads and sexts from v8i16 to v8i32. The
>>>             x86 backend recognizes that these are addition
>>>             reductions of multiplication so we use the vpmaddwd
>>>             instruction which calculates 32-bit products from 16-bit
>>>             inputs and does a horizontal add of adjacent pairs. A
>>>             vpmaddwd given two v8i16 inputs will produce a v4i32 result.
>>
>>
>>         That godbolt link seems wrong. It wasn't supposed to be clang
>>         IR. This should be right.
>>          
>>
>>>
>>>             In the example code, because we are reducing the number
>>>             of elements from 8->4 in the vpmaddwd step we are left
>>>             with a width mismatch between vpmaddwd and the vpaddd
>>>             instruction that we use to sum with the results from the
>>>             previous loop iterations. We rely on the fact that a
>>>             128-bit vpmaddwd zeros the upper bits of the register so
>>>             that we can use a 256-bit vpaddd instruction so that the
>>>             upper elements can keep going around the loop without
>>>             being disturbed in case they weren't initialized to 0.
>>>             But this still means the vpmaddwd instruction is doing
>>>             half the amount of work the CPU is capable of if we had
>>>             been able to use a 256-bit vpmaddwd instruction.
>>>             Additionally, future x86 CPUs will be gaining an
>>>             instruction that can do VPMADDWD and VPADDD in one
>>>             instruction, but that width mismatch makes that
>>>             instruction difficult to utilize.
>>>
>>>             In order for the backend to handle this better it would
>>>             be great if we could have something like two v32i8
>>>             loads, two shufflevectors to extract the even elements
>>>             and the odd elements to create four v16i8 pieces.
>>
>>             Why v*i8 loads? I thought that we have 16-bit and 32-bit
>>             types here?
>>
>>
>>         Oops that should have been v16i16. Mixed up my 256-bit types.
>>          
>>
>>
>>>             Sign extend each of those pieces. Multiply the two even
>>>             pieces and the two odd pieces separately, sum those
>>>             results with a v8i32 add. Then another v8i32 add to
>>>             accumulate the previous loop iterations.
>>
>
>         I'm still missing something. Why do you want to separate out
>         the even and odd parts instead of just adding up the first
>         half of the numbers and the second half?
>
>
>     Doing even/odd matches up with a pattern I already have to support
>     for the code in https://reviews.llvm.org/D49636. I wouldn't even
>     need to detect is as a reduction to do the reassocation since
>     even/odd exactly matches the behavior of the instruction. But
>     you're right we could also just detect the reduction and add two
>     halves.
>
>      
>
>
>         Thanks again,
>         Hal
>
>>>             Then ensures that no pieces exceed the target vector
>>>             width and the final operation is correctly sized to go
>>>             around the loop in one register. All but the last add
>>>             can then be pattern matched to vpmaddwd as proposed
>>>             in https://reviews.llvm.org/D49636. And for the future
>>>             CPU the whole thing can be matched to the new instruction.
>>>
>>>             Do other targets have a similar instruction or a similar
>>>             issue to this? Is this something we can solve in the
>>>             loop vectorizer? Or should we have a separate IR
>>>             transformation that can recognize this pattern and
>>>             generate the new sequence? As a separate pass we would
>>>             need to pair two vector loads together, remove a
>>>             reduction step outside the loop and remove half the phis
>>>             assuming the loop was partially unrolled. Or if there
>>>             was only one add/mul inside the loop we'd have to reduce
>>>             its width and the width of the phi.
>>
>>             Can you explain how the desired code from the vectorizer
>>             differs from the code that the vectorizer produces if you
>>             add '#pragma clang loop vectorize(enable)
>>             vectorize_width(16)'  above the loop? I tried it in your
>>             godbolt example and the generated code looks very similar
>>             to the icc-generated code.
>>
>>
>>         It's similar, but the vpxor %xmm0, %xmm0, %xmm0 is being
>>         unnecessarily carried across the loop. It's then redundantly
>>         added twice in the reduction after the loop despite it being
>>         0. This happens because we basically tricked the backend into
>>         generating a 256-bit vpmaddwd concated with a 256-bit zero
>>         vector going into a 512-bit vaddd before type legalization.
>>         The 512-bit concat and vpaddd get split during type
>>         legalization, and the high half of the add gets constant
>>         folded away. I'm guessing we probably finished with 4 vpxors
>>         before the loop but MachineCSE(or some other pass?) combined
>>         two of them when it figured out the loop didn't modify them.
>>          
>>
>>
>>             Thanks again,
>>             Hal
>>
>>>
>>>             Thanks,
>>>             ~Craig
>>
>>             -- 
>>             Hal Finkel
>>             Lead, Compiler Technology and Programming Languages
>>             Leadership Computing Facility
>>             Argonne National Laboratory
>>
>
>         -- 
>         Hal Finkel
>         Lead, Compiler Technology and Programming Languages
>         Leadership Computing Facility
>         Argonne National Laboratory
>

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180724/e1da2433/attachment.html>