[llvm-dev] [LoopVectorizer] Improving the performance of dot product reduction loop

Mon Jul 23 15:22:27 PDT 2018

Hello all,

This code https://godbolt.org/g/tTyxpf is a dot product reduction loop
multipying sign extended 16-bit values to produce a 32-bit accumulated
result. The x86 backend is currently not able to optimize it as well as gcc
and icc. The IR we are getting from the loop vectorizer has several v8i32
adds and muls inside the loop. These are fed by v8i16 loads and sexts from
v8i16 to v8i32. The x86 backend recognizes that these are addition
reductions of multiplication so we use the vpmaddwd instruction which
calculates 32-bit products from 16-bit inputs and does a horizontal add of
adjacent pairs. A vpmaddwd given two v8i16 inputs will produce a v4i32
result.

In the example code, because we are reducing the number of elements from
8->4 in the vpmaddwd step we are left with a width mismatch between
vpmaddwd and the vpaddd instruction that we use to sum with the results
from the previous loop iterations. We rely on the fact that a 128-bit
vpmaddwd zeros the upper bits of the register so that we can use a 256-bit
vpaddd instruction so that the upper elements can keep going around the
loop without being disturbed in case they weren't initialized to 0. But
this still means the vpmaddwd instruction is doing half the amount of work
the CPU is capable of if we had been able to use a 256-bit vpmaddwd
instruction. Additionally, future x86 CPUs will be gaining an instruction
that can do VPMADDWD and VPADDD in one instruction, but that width mismatch
makes that instruction difficult to utilize.

In order for the backend to handle this better it would be great if we
could have something like two v32i8 loads, two shufflevectors to extract
the even elements and the odd elements to create four v16i8 pieces.Sign
extend each of those pieces. Multiply the two even pieces and the two odd
pieces separately, sum those results with a v8i32 add. Then another v8i32
add to accumulate the previous loop iterations. Then ensures that no pieces
exceed the target vector width and the final operation is correctly sized
to go around the loop in one register. All but the last add can then be
pattern matched to vpmaddwd as proposed in https://reviews.llvm.org/D49636.
And for the future CPU the whole thing can be matched to the new
instruction.

Do other targets have a similar instruction or a similar issue to this? Is
this something we can solve in the loop vectorizer? Or should we have a
separate IR transformation that can recognize this pattern and generate the
new sequence? As a separate pass we would need to pair two vector loads
together, remove a reduction step outside the loop and remove half the phis
assuming the loop was partially unrolled. Or if there was only one add/mul
inside the loop we'd have to reduce its width and the width of the phi.

Thanks,
~Craig
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180723/8e50d952/attachment.html>