[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Tue Jul 24 11:41:06 PDT 2018
On 07/24/2018 12:07 PM, Craig Topper wrote:
> With maximize-bandwidth I'd still end up with the extra vpxors above
> the loop and the extra addition reduction steps at the end that we get
> from forcing the vf to 16 right?
Yea, I think we'd need something else to help with that. My underlying
thought here is that, based on your description, we really do want a VF
of 16 (because we want to use 256-bit loads, etc.). And so, when
thinking about how to fix things, we should start with looking at the VF
= 16 output, and not the VF = 8 output, as the right starting point for
the backend.
-Hal
>
> ~Craig
>
>
> On Tue, Jul 24, 2018 at 10:04 AM Craig Topper <craig.topper at gmail.com
> <mailto:craig.topper at gmail.com>> wrote:
>
>
>
> On Tue, Jul 24, 2018 at 6:10 AM Hal Finkel <hfinkel at anl.gov
> <mailto:hfinkel at anl.gov>> wrote:
>
>
> On 07/23/2018 06:37 PM, Craig Topper wrote:
>>
>> ~Craig
>>
>>
>> On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <hfinkel at anl.gov
>> <mailto:hfinkel at anl.gov>> wrote:
>>
>>
>> On 07/23/2018 05:22 PM, Craig Topper wrote:
>>> Hello all,
>>>
>>> This code https://godbolt.org/g/tTyxpf is a dot product
>>> reduction loop multipying sign extended 16-bit values to
>>> produce a 32-bit accumulated result. The x86 backend is
>>> currently not able to optimize it as well as gcc and
>>> icc. The IR we are getting from the loop vectorizer has
>>> several v8i32 adds and muls inside the loop. These are
>>> fed by v8i16 loads and sexts from v8i16 to v8i32. The
>>> x86 backend recognizes that these are addition
>>> reductions of multiplication so we use the vpmaddwd
>>> instruction which calculates 32-bit products from 16-bit
>>> inputs and does a horizontal add of adjacent pairs. A
>>> vpmaddwd given two v8i16 inputs will produce a v4i32 result.
>>
>>
>> That godbolt link seems wrong. It wasn't supposed to be clang
>> IR. This should be right.
>>
>>
>>>
>>> In the example code, because we are reducing the number
>>> of elements from 8->4 in the vpmaddwd step we are left
>>> with a width mismatch between vpmaddwd and the vpaddd
>>> instruction that we use to sum with the results from the
>>> previous loop iterations. We rely on the fact that a
>>> 128-bit vpmaddwd zeros the upper bits of the register so
>>> that we can use a 256-bit vpaddd instruction so that the
>>> upper elements can keep going around the loop without
>>> being disturbed in case they weren't initialized to 0.
>>> But this still means the vpmaddwd instruction is doing
>>> half the amount of work the CPU is capable of if we had
>>> been able to use a 256-bit vpmaddwd instruction.
>>> Additionally, future x86 CPUs will be gaining an
>>> instruction that can do VPMADDWD and VPADDD in one
>>> instruction, but that width mismatch makes that
>>> instruction difficult to utilize.
>>>
>>> In order for the backend to handle this better it would
>>> be great if we could have something like two v32i8
>>> loads, two shufflevectors to extract the even elements
>>> and the odd elements to create four v16i8 pieces.
>>
>> Why v*i8 loads? I thought that we have 16-bit and 32-bit
>> types here?
>>
>>
>> Oops that should have been v16i16. Mixed up my 256-bit types.
>>
>>
>>
>>> Sign extend each of those pieces. Multiply the two even
>>> pieces and the two odd pieces separately, sum those
>>> results with a v8i32 add. Then another v8i32 add to
>>> accumulate the previous loop iterations.
>>
>
> I'm still missing something. Why do you want to separate out
> the even and odd parts instead of just adding up the first
> half of the numbers and the second half?
>
>
> Doing even/odd matches up with a pattern I already have to support
> for the code in https://reviews.llvm.org/D49636. I wouldn't even
> need to detect is as a reduction to do the reassocation since
> even/odd exactly matches the behavior of the instruction. But
> you're right we could also just detect the reduction and add two
> halves.
>
>
>
>
> Thanks again,
> Hal
>
>>> Then ensures that no pieces exceed the target vector
>>> width and the final operation is correctly sized to go
>>> around the loop in one register. All but the last add
>>> can then be pattern matched to vpmaddwd as proposed
>>> in https://reviews.llvm.org/D49636. And for the future
>>> CPU the whole thing can be matched to the new instruction.
>>>
>>> Do other targets have a similar instruction or a similar
>>> issue to this? Is this something we can solve in the
>>> loop vectorizer? Or should we have a separate IR
>>> transformation that can recognize this pattern and
>>> generate the new sequence? As a separate pass we would
>>> need to pair two vector loads together, remove a
>>> reduction step outside the loop and remove half the phis
>>> assuming the loop was partially unrolled. Or if there
>>> was only one add/mul inside the loop we'd have to reduce
>>> its width and the width of the phi.
>>
>> Can you explain how the desired code from the vectorizer
>> differs from the code that the vectorizer produces if you
>> add '#pragma clang loop vectorize(enable)
>> vectorize_width(16)' above the loop? I tried it in your
>> godbolt example and the generated code looks very similar
>> to the icc-generated code.
>>
>>
>> It's similar, but the vpxor %xmm0, %xmm0, %xmm0 is being
>> unnecessarily carried across the loop. It's then redundantly
>> added twice in the reduction after the loop despite it being
>> 0. This happens because we basically tricked the backend into
>> generating a 256-bit vpmaddwd concated with a 256-bit zero
>> vector going into a 512-bit vaddd before type legalization.
>> The 512-bit concat and vpaddd get split during type
>> legalization, and the high half of the add gets constant
>> folded away. I'm guessing we probably finished with 4 vpxors
>> before the loop but MachineCSE(or some other pass?) combined
>> two of them when it figured out the loop didn't modify them.
>>
>>
>>
>> Thanks again,
>> Hal
>>
>>>
>>> Thanks,
>>> ~Craig
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180724/e1da2433/attachment.html>
More information about the llvm-dev
mailing list