[llvm-dev] [RFC] Introducing a vector reduction add instruction.

Mon Nov 16 11:23:43 PST 2015

On Fri, Nov 13, 2015 at 10:03 AM, Xinliang David Li <davidxl at google.com> wrote:
> Reductions can be implemented in two ways with SIMD
> 1) partial reduction in the loop followed by a full reduction outside the
> loop
> 2) use full SIMD reduction instruction in the loop
>
> Having a way for the vectorizer to express patterns for 2) sounds useful.
>
> The hsum intrinsic proposed by Shahid looks like a full reduction for 'ADD'
> operation. The question is 1) do we need to add new instruction for this? 2)
> do we need to generalize vector reduction to cover other scalar opcodes such
> as min/max etc?
>
> As regards SAD implementation: the IR idiom for SAD involves integer
> promotion/widening, so it seems to me that what is missing is a widening
> diff operation/intrinsic. With that, the SAD pattern can be expressed by
> vectorizor as :
>
> for (... ) {
>    v16i8  v1 = vect_load (...)
>    v16i8  v2 = vect_load (...)
>    v16i16  vdiff = vect_widen_diff(v1, v2)
>    v16i16  vabsdiff = vect_abs(vdiff)
>    i16 ss = vect_hsum(vabsdiff)
>    r += ss
>    ..
> }

I think doing full reduction with hadd is unnecessary in loop: in most
cases it is inefficient than the partial reduction in loop + full
reduction outside of loop. But SSE's SAD and dot-product instructions
do horizontal adds in nature and if we want to combine instructions
into them, we need the information that we are doing reductions. That
is why I am proposing a reduction add IR operation.

Cong

>
> David
>
>
> On Thu, Nov 12, 2015 at 4:16 PM, Cong Hou <congh at google.com> wrote:
>>
>> Hi
>>
>> When a reduction instruction is vectorized in a loop, it will be
>> turned into an instruction with vector operands of the same operation
>> type. This new instruction has a special property that can give us
>> more flexibility during instruction selection later: this operation is
>> valid as long as the reduction of all elements of the result vector is
>> identical to the reduction of all elements of its operands.
>>
>> One example that can benefit this property is SAD (sum of absolute
>> differences) pattern detection in SSE2, which provides a psadbw
>> instruction whose description is shown below:
>>
>> '''
>> psadbw: Compute the absolute differences of packed unsigned 8-bit
>> integers in a and b, then horizontally sum each consecutive 8
>> differences to produce two unsigned 16-bit integers, and pack these
>> unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.
>> '''
>>
>> In LLVM's IR, for a SAD loop we will have two v4i8 as inputs and one
>> v4i32 as output. However, psadbw will actually produce one i32 result
>> for four pairs of 8-bit integers (an already reduced result), and the
>> result is stored in the first element in v4i32. If we properly zero
>> out the other three elements in v4i32, and with the information that
>> we have a reduction add that is performed on this result, then we can
>> safely use psadbw here for much better performance. This can be done
>> during DAG combine. Another similar example is dot product. And I
>> think there may be many other scenarios that can benefit from this
>> property like eliminating redundant shuffles.
>>
>> The question is, how to let DAG combiner know that a vector operation
>> is a reduction one?
>>
>> Here I propose to introduce a "reduction add" instruction for vectors.
>> This will be a new instruction with vector operands only. Normally it
>> is treated as a normal ADD operation, but the selection DAG combiner
>> can make use of this new operation to generate better instructions.
>> This new instruction is generated when vectorizing reduction add in
>> loop vectorizer.
>>
>> I would like to hear more comments on this proposal or suggestions of
>> better alternative implementations.
>>
>>
>> thanks,
>> Cong
>
>