[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop

Mon Jul 23 16:34:55 PDT 2018


On 07/23/2018 06:23 PM, Hal Finkel via llvm-dev wrote:
>
> On 07/23/2018 05:22 PM, Craig Topper wrote:
>> Hello all,
>>
>> This code https://godbolt.org/g/tTyxpf is a dot product reduction
>> loop multipying sign extended 16-bit values to produce a 32-bit
>> accumulated result. The x86 backend is currently not able to optimize
>> it as well as gcc and icc. The IR we are getting from the loop
>> vectorizer has several v8i32 adds and muls inside the loop. These are
>> fed by v8i16 loads and sexts from v8i16 to v8i32. The x86 backend
>> recognizes that these are addition reductions of multiplication so we
>> use the vpmaddwd instruction which calculates 32-bit products from
>> 16-bit inputs and does a horizontal add of adjacent pairs. A vpmaddwd
>> given two v8i16 inputs will produce a v4i32 result.
>>
>> In the example code, because we are reducing the number of elements
>> from 8->4 in the vpmaddwd step we are left with a width mismatch
>> between vpmaddwd and the vpaddd instruction that we use to sum with
>> the results from the previous loop iterations. We rely on the fact
>> that a 128-bit vpmaddwd zeros the upper bits of the register so that
>> we can use a 256-bit vpaddd instruction so that the upper elements
>> can keep going around the loop without being disturbed in case they
>> weren't initialized to 0. But this still means the vpmaddwd
>> instruction is doing half the amount of work the CPU is capable of if
>> we had been able to use a 256-bit vpmaddwd instruction. Additionally,
>> future x86 CPUs will be gaining an instruction that can do VPMADDWD
>> and VPADDD in one instruction, but that width mismatch makes that
>> instruction difficult to utilize.
>>
>> In order for the backend to handle this better it would be great if
>> we could have something like two v32i8 loads, two shufflevectors to
>> extract the even elements and the odd elements to create four v16i8
>> pieces.
>
> Why v*i8 loads? I thought that we have 16-bit and 32-bit types here?
>
>> Sign extend each of those pieces. Multiply the two even pieces and
>> the two odd pieces separately, sum those results with a v8i32 add.
>> Then another v8i32 add to accumulate the previous loop iterations.
>> Then ensures that no pieces exceed the target vector width and the
>> final operation is correctly sized to go around the loop in one
>> register. All but the last add can then be pattern matched to
>> vpmaddwd as proposed in https://reviews.llvm.org/D49636. And for the
>> future CPU the whole thing can be matched to the new instruction.
>>
>> Do other targets have a similar instruction or a similar issue to
>> this? Is this something we can solve in the loop vectorizer? Or
>> should we have a separate IR transformation that can recognize this
>> pattern and generate the new sequence? As a separate pass we would
>> need to pair two vector loads together, remove a reduction step
>> outside the loop and remove half the phis assuming the loop was
>> partially unrolled. Or if there was only one add/mul inside the loop
>> we'd have to reduce its width and the width of the phi.
>
> Can you explain how the desired code from the vectorizer differs from
> the code that the vectorizer produces if you add '#pragma clang loop
> vectorize(enable) vectorize_width(16)'  above the loop? I tried it in
> your godbolt example and the generated code looks very similar to the
> icc-generated code.

(specifically, I mean this: https://godbolt.org/g/LJA38e)

>
> Thanks again,
> Hal
>
>>
>> Thanks,
>> ~Craig
>
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180723/ac36030b/attachment.html>