[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop

Mon Jul 23 16:23:53 PDT 2018

On 07/23/2018 05:22 PM, Craig Topper wrote:
> Hello all,
>
> This code https://godbolt.org/g/tTyxpf is a dot product reduction loop
> multipying sign extended 16-bit values to produce a 32-bit accumulated
> result. The x86 backend is currently not able to optimize it as well
> as gcc and icc. The IR we are getting from the loop vectorizer has
> several v8i32 adds and muls inside the loop. These are fed by v8i16
> loads and sexts from v8i16 to v8i32. The x86 backend recognizes that
> these are addition reductions of multiplication so we use the vpmaddwd
> instruction which calculates 32-bit products from 16-bit inputs and
> does a horizontal add of adjacent pairs. A vpmaddwd given two v8i16
> inputs will produce a v4i32 result.
>
> In the example code, because we are reducing the number of elements
> from 8->4 in the vpmaddwd step we are left with a width mismatch
> between vpmaddwd and the vpaddd instruction that we use to sum with
> the results from the previous loop iterations. We rely on the fact
> that a 128-bit vpmaddwd zeros the upper bits of the register so that
> we can use a 256-bit vpaddd instruction so that the upper elements can
> keep going around the loop without being disturbed in case they
> weren't initialized to 0. But this still means the vpmaddwd
> instruction is doing half the amount of work the CPU is capable of if
> we had been able to use a 256-bit vpmaddwd instruction. Additionally,
> future x86 CPUs will be gaining an instruction that can do VPMADDWD
> and VPADDD in one instruction, but that width mismatch makes that
> instruction difficult to utilize.
>
> In order for the backend to handle this better it would be great if we
> could have something like two v32i8 loads, two shufflevectors to
> extract the even elements and the odd elements to create four v16i8
> pieces.

Why v*i8 loads? I thought that we have 16-bit and 32-bit types here?

> Sign extend each of those pieces. Multiply the two even pieces and the
> two odd pieces separately, sum those results with a v8i32 add. Then
> another v8i32 add to accumulate the previous loop iterations. Then
> ensures that no pieces exceed the target vector width and the final
> operation is correctly sized to go around the loop in one register.
> All but the last add can then be pattern matched to vpmaddwd as
> proposed in https://reviews.llvm.org/D49636. And for the future CPU
> the whole thing can be matched to the new instruction.
>
> Do other targets have a similar instruction or a similar issue to
> this? Is this something we can solve in the loop vectorizer? Or should
> we have a separate IR transformation that can recognize this pattern
> and generate the new sequence? As a separate pass we would need to
> pair two vector loads together, remove a reduction step outside the
> loop and remove half the phis assuming the loop was partially
> unrolled. Or if there was only one add/mul inside the loop we'd have
> to reduce its width and the width of the phi.

Can you explain how the desired code from the vectorizer differs from
the code that the vectorizer produces if you add '#pragma clang loop
vectorize(enable) vectorize_width(16)'  above the loop? I tried it in
your godbolt example and the generated code looks very similar to the
icc-generated code.

Thanks again,
Hal

>
> Thanks,
> ~Craig

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180723/21f45a42/attachment.html>