[llvm-dev] [RFC] Introducing a vector reduction add instruction.

Thu Nov 12 16:16:45 PST 2015

Hi

When a reduction instruction is vectorized in a loop, it will be
turned into an instruction with vector operands of the same operation
type. This new instruction has a special property that can give us
more flexibility during instruction selection later: this operation is
valid as long as the reduction of all elements of the result vector is
identical to the reduction of all elements of its operands.

One example that can benefit this property is SAD (sum of absolute
differences) pattern detection in SSE2, which provides a psadbw
instruction whose description is shown below:

'''
psadbw: Compute the absolute differences of packed unsigned 8-bit
integers in a and b, then horizontally sum each consecutive 8
differences to produce two unsigned 16-bit integers, and pack these
unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.
'''

In LLVM's IR, for a SAD loop we will have two v4i8 as inputs and one
v4i32 as output. However, psadbw will actually produce one i32 result
for four pairs of 8-bit integers (an already reduced result), and the
result is stored in the first element in v4i32. If we properly zero
out the other three elements in v4i32, and with the information that
we have a reduction add that is performed on this result, then we can
safely use psadbw here for much better performance. This can be done
during DAG combine. Another similar example is dot product. And I
think there may be many other scenarios that can benefit from this
property like eliminating redundant shuffles.

The question is, how to let DAG combiner know that a vector operation
is a reduction one?

Here I propose to introduce a "reduction add" instruction for vectors.
This will be a new instruction with vector operands only. Normally it
is treated as a normal ADD operation, but the selection DAG combiner
can make use of this new operation to generate better instructions.
This new instruction is generated when vectorizing reduction add in
loop vectorizer.

I would like to hear more comments on this proposal or suggestions of
better alternative implementations.

thanks,
Cong