[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop

Tue Jul 24 10:04:58 PDT 2018

On Tue, Jul 24, 2018 at 6:10 AM Hal Finkel <hfinkel at anl.gov> wrote:

>
> On 07/23/2018 06:37 PM, Craig Topper wrote:
>
>
> ~Craig
>
>
> On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <hfinkel at anl.gov> wrote:
>
>>
>> On 07/23/2018 05:22 PM, Craig Topper wrote:
>>
>> Hello all,
>>
>> This code https://godbolt.org/g/tTyxpf is a dot product reduction loop
>> multipying sign extended 16-bit values to produce a 32-bit accumulated
>> result. The x86 backend is currently not able to optimize it as well as gcc
>> and icc. The IR we are getting from the loop vectorizer has several v8i32
>> adds and muls inside the loop. These are fed by v8i16 loads and sexts from
>> v8i16 to v8i32. The x86 backend recognizes that these are addition
>> reductions of multiplication so we use the vpmaddwd instruction which
>> calculates 32-bit products from 16-bit inputs and does a horizontal add of
>> adjacent pairs. A vpmaddwd given two v8i16 inputs will produce a v4i32
>> result.
>>
>>
> That godbolt link seems wrong. It wasn't supposed to be clang IR. This
> should be right.
>
>
>>
>> In the example code, because we are reducing the number of elements from
>> 8->4 in the vpmaddwd step we are left with a width mismatch between
>> vpmaddwd and the vpaddd instruction that we use to sum with the results
>> from the previous loop iterations. We rely on the fact that a 128-bit
>> vpmaddwd zeros the upper bits of the register so that we can use a 256-bit
>> vpaddd instruction so that the upper elements can keep going around the
>> loop without being disturbed in case they weren't initialized to 0. But
>> this still means the vpmaddwd instruction is doing half the amount of work
>> the CPU is capable of if we had been able to use a 256-bit vpmaddwd
>> instruction. Additionally, future x86 CPUs will be gaining an instruction
>> that can do VPMADDWD and VPADDD in one instruction, but that width mismatch
>> makes that instruction difficult to utilize.
>>
>> In order for the backend to handle this better it would be great if we
>> could have something like two v32i8 loads, two shufflevectors to extract
>> the even elements and the odd elements to create four v16i8 pieces.
>>
>>
>> Why v*i8 loads? I thought that we have 16-bit and 32-bit types here?
>>
>
> Oops that should have been v16i16. Mixed up my 256-bit types.
>
>
>>
>> Sign extend each of those pieces. Multiply the two even pieces and the
>> two odd pieces separately, sum those results with a v8i32 add. Then another
>> v8i32 add to accumulate the previous loop iterations.
>>
>>
> I'm still missing something. Why do you want to separate out the even and
> odd parts instead of just adding up the first half of the numbers and the
> second half?
>

Doing even/odd matches up with a pattern I already have to support for the
code in https://reviews.llvm.org/D49636. I wouldn't even need to detect is
as a reduction to do the reassocation since even/odd exactly matches the
behavior of the instruction. But you're right we could also just detect the
reduction and add two halves.

>
> Thanks again,
> Hal
>
> Then ensures that no pieces exceed the target vector width and the final
>> operation is correctly sized to go around the loop in one register. All but
>> the last add can then be pattern matched to vpmaddwd as proposed in
>> https://reviews.llvm.org/D49636. And for the future CPU the whole thing
>> can be matched to the new instruction.
>>
>> Do other targets have a similar instruction or a similar issue to this?
>> Is this something we can solve in the loop vectorizer? Or should we have a
>> separate IR transformation that can recognize this pattern and generate the
>> new sequence? As a separate pass we would need to pair two vector loads
>> together, remove a reduction step outside the loop and remove half the phis
>> assuming the loop was partially unrolled. Or if there was only one add/mul
>> inside the loop we'd have to reduce its width and the width of the phi.
>>
>>
>> Can you explain how the desired code from the vectorizer differs from the
>> code that the vectorizer produces if you add '#pragma clang loop
>> vectorize(enable) vectorize_width(16)'  above the loop? I tried it in your
>> godbolt example and the generated code looks very similar to the
>> icc-generated code.
>>
>
> It's similar, but the vpxor %xmm0, %xmm0, %xmm0 is being unnecessarily
> carried across the loop. It's then redundantly added twice in the reduction
> after the loop despite it being 0. This happens because we basically
> tricked the backend into generating a 256-bit vpmaddwd concated with a
> 256-bit zero vector going into a 512-bit vaddd before type legalization.
> The 512-bit concat and vpaddd get split during type legalization, and the
> high half of the add gets constant folded away. I'm guessing we probably
> finished with 4 vpxors before the loop but MachineCSE(or some other pass?)
> combined two of them when it figured out the loop didn't modify them.
>
>
>>
>> Thanks again,
>> Hal
>>
>>
>> Thanks,
>> ~Craig
>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180724/b2237b9c/attachment.html>