[llvm-dev] [RFC] Introducing a vector reduction add instruction.

Fri Nov 13 10:03:44 PST 2015

Reductions can be implemented in two ways with SIMD
1) partial reduction in the loop followed by a full reduction outside the
loop
2) use full SIMD reduction instruction in the loop

Having a way for the vectorizer to express patterns for 2) sounds useful.

The hsum intrinsic proposed by Shahid looks like a full reduction for 'ADD'
operation. The question is 1) do we need to add new instruction for this?
2) do we need to generalize vector reduction to cover other scalar opcodes
such as min/max etc?

As regards SAD implementation: the IR idiom for SAD involves integer
promotion/widening, so it seems to me that what is missing is a widening
diff operation/intrinsic. With that, the SAD pattern can be expressed by
vectorizor as :

for (... ) {
   v16i8  v1 = vect_load (...)
   v16i8  v2 = vect_load (...)
   v16i16  vdiff = vect_widen_diff(v1, v2)
   v16i16  vabsdiff = vect_abs(vdiff)
   i16 ss = vect_hsum(vabsdiff)
   r += ss
   ..
}

David

On Thu, Nov 12, 2015 at 4:16 PM, Cong Hou <congh at google.com> wrote:

> Hi
>
> When a reduction instruction is vectorized in a loop, it will be
> turned into an instruction with vector operands of the same operation
> type. This new instruction has a special property that can give us
> more flexibility during instruction selection later: this operation is
> valid as long as the reduction of all elements of the result vector is
> identical to the reduction of all elements of its operands.
>
> One example that can benefit this property is SAD (sum of absolute
> differences) pattern detection in SSE2, which provides a psadbw
> instruction whose description is shown below:
>
> '''
> psadbw: Compute the absolute differences of packed unsigned 8-bit
> integers in a and b, then horizontally sum each consecutive 8
> differences to produce two unsigned 16-bit integers, and pack these
> unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.
> '''
>
> In LLVM's IR, for a SAD loop we will have two v4i8 as inputs and one
> v4i32 as output. However, psadbw will actually produce one i32 result
> for four pairs of 8-bit integers (an already reduced result), and the
> result is stored in the first element in v4i32. If we properly zero
> out the other three elements in v4i32, and with the information that
> we have a reduction add that is performed on this result, then we can
> safely use psadbw here for much better performance. This can be done
> during DAG combine. Another similar example is dot product. And I
> think there may be many other scenarios that can benefit from this
> property like eliminating redundant shuffles.
>
> The question is, how to let DAG combiner know that a vector operation
> is a reduction one?
>
> Here I propose to introduce a "reduction add" instruction for vectors.
> This will be a new instruction with vector operands only. Normally it
> is treated as a normal ADD operation, but the selection DAG combiner
> can make use of this new operation to generate better instructions.
> This new instruction is generated when vectorizing reduction add in
> loop vectorizer.
>
> I would like to hear more comments on this proposal or suggestions of
> better alternative implementations.
>
>
> thanks,
> Cong
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151113/7870dc3e/attachment.html>