[llvm-dev] [RFC] Introducing a vector reduction add instruction.

Thu Nov 19 22:21:05 PST 2015

On Thu, Nov 19, 2015 at 10:09 PM, Xinliang David Li <davidxl at google.com> wrote:
> This sounds like the right approach and the result is very promising. The VF
> tuning patch is already in but just not enabled by default, right?

Yes, but the cost model is not ready yet, which depends on the patch
that optimizes promotion/demotion between vectors of i8 and i32
(http://reviews.llvm.org/D14588). Once this patch is checked in, I
will update the cost model to let bigger VF be chosen.

Cong

>
> David
>
> On Thu, Nov 19, 2015 at 1:12 PM, Cong Hou <congh at google.com> wrote:
>>
>> After some attempt to implement reduce-add in LLVM, I found out a
>> easier way to detect reduce-add without introducing new IR operations.
>> The basic idea is annotating phi node instead of add (so that it is
>> easier to handle other reduction operations). In PHINode class, we can
>> add a flag indicating if the phi node is a reduction one (the flag can
>> be set in loop vectorizer for vectorized phi nodes). Then when we
>> build SDNode for instruction selection, we detect those reduction phi
>> nodes and then annotate reduction operations. This requires an
>> additional flag in SDNodeFlags. We can then check this flag when
>> combining instructions to detect reduction operations.
>>
>> In this approach, I have managed to let LLVM compile a SAD loop into
>> psadbw instructions.
>>
>> Source code:
>>
>>
>> const int N = 1024;
>> unsigned char a[N], b[N];
>>
>> int sad() {
>>   int s = 0;
>>   for (int i = 0; i < N; ++i) {
>>     int res = a[i] - b[i];
>>     s += (res > 0) ? res : -res;
>>   }
>>   return s;
>> }
>>
>>
>> Emitted instructions on X86:
>>
>>
>>
>> # BB#0:                                 # %entry
>> pxor %xmm0, %xmm0
>> movq $-1024, %rax            # imm = 0xFFFFFFFFFFFFFC00
>> pxor %xmm1, %xmm1
>> .align 16, 0x90
>> .LBB0_1:                                # %vector.body
>>                                         # =>This Inner Loop Header:
>> Depth=1
>> movd b+1024(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
>> movd a+1024(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
>> psadbw %xmm2, %xmm3
>> paddd %xmm3, %xmm0
>> movd b+1028(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
>> movd a+1028(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
>> psadbw %xmm2, %xmm3
>> paddd %xmm3, %xmm1
>> addq $8, %rax
>> jne .LBB0_1
>> # BB#2:                                 # %middle.block
>> paddd %xmm0, %xmm1
>> pshufd $78, %xmm1, %xmm0       # xmm0 = xmm1[2,3,0,1]
>> paddd %xmm1, %xmm0
>> pshufd $229, %xmm0, %xmm1      # xmm1 = xmm0[1,1,2,3]
>> paddd %xmm0, %xmm1
>> movd %xmm1, %eax
>> retq
>>
>>
>> Note that due to smaller VF we are using now (currently 4), we could
>> not explore the most benefit of psadbw. The patch in
>> http://reviews.llvm.org/D8943 has enables us to use bigger VFs based
>> on the smallest type in a loop. The follow-up work is refining the
>> cost model to let bigger VFs have less cost. For the example above the
>> best result is from VF >=16.
>>
>> The draft of the patch is here: http://reviews.llvm.org/D14840
>>
>> I will refine the patch later and submit it for review.
>>
>>
>> thanks,
>> Cong
>>
>>
>> On Wed, Nov 18, 2015 at 2:45 PM, Cong Hou <congh at google.com> wrote:
>> > On Mon, Nov 16, 2015 at 9:31 PM, Shahid, Asghar-ahmad
>> > <Asghar-ahmad.Shahid at amd.com> wrote:
>> >> Hi Cong,
>> >>
>> >>> -----Original Message-----
>> >>> From: Cong Hou [mailto:congh at google.com]
>> >>> Sent: Tuesday, November 17, 2015 12:47 AM
>> >>> To: Shahid, Asghar-ahmad
>> >>> Cc: David Li
>> >>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add
>> >>> instruction.
>> >>>
>> >>> On Thu, Nov 12, 2015 at 9:37 PM, Shahid, Asghar-ahmad <Asghar-
>> >>> ahmad.Shahid at amd.com> wrote:
>> >>> > Hi Cong,
>> >>> >
>> >>> > We had proposed an intrinsic approach to do this. However the
>> >>> > discussion reached to a point where it was asked that "Why do we
>> >>> > need
>> >>> > another approach if "reduction add" can be pattern matched in
>> >>> DAGCombine?"
>> >>> > However I feel if we have strong enough rationale for introduction
>> >>> > of
>> >>> > this instruction, it would be great. The 1st link below has the
>> >>> > complete
>> >>> discussion about the intrinsic approach.
>> >>>
>> >>> Yes, I think introducing such a reduction add can let us do pattern
>> >>> recognition
>> >>> of either SAD or dot production (or more?) without introducing any
>> >>> additional intrinsics.
>> >> I agree. Another use case could be POPCOUNT operation. Moreover, as
>> >> 'reduction add'
>> >> Is being adopted by more targets now a days, reflecting that in LLVM IR
>> >> as an instruction
>> >> Is a good idea.
>> >> BTW, what is the idea of the syntax and semantic of this operation you
>> >> have?
>> >
>> > We can introduce a reduce-add for vectors only, or make it general so
>> > that it could also accept scale operands. Normally it is identical to
>> > normal add, but during instruction selection we can do pattern
>> > recognition based on more information provided by this new operations.
>> > For vectors, this means the result of this operation only guarantee
>> > that the sum of all elements in the result is identical to the sum of
>> > all elements of its operands. This gives us enough freedom to do
>> > aggressive transformations, such as SAD or dot-product.
>> >
>> > Do you think if this is enough for us to get there?
>> >
>> >
>> > Cong
>> >
>> >>
>> >> The main concern may be cost
>> >>> model: we could not guarantee that a SAD loop is unrolled 16 times on
>> >>> SSE to
>> >>> make use the most benefit of SAD. After the patch
>> >>> http://reviews.llvm.org/D8943 is landed, I am now working on improving
>> >>> cost
>> >>> models of type conversions on X86. I believe even without SAD
>> >>> instruction
>> >>> we can still get better performance with unrolling a SAD loop 16
>> >>> times. This
>> >>> seems tricky but it works. What do you think?
>> >> I also think same as we can increase the bandwidth with proper cost
>> >> modeling.
>> >>
>> >> Regards,
>> >> Shahid
>> >>>
>> >>> Cong
>> >>>
>> >>> >
>> >>> > http://reviews.llvm.org/D10964
>> >>> > http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html
>> >>> >
>> >>> > Regards,
>> >>> > Shahid
>> >>> >
>> >>> >
>> >>> >
>> >>> >> -----Original Message-----
>> >>> >> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf
>> >>> >> Of
>> >>> >> Cong Hou via llvm-dev
>> >>> >> Sent: Friday, November 13, 2015 5:47 AM
>> >>> >> To: llvm-dev at lists.llvm.org
>> >>> >> Cc: David Li
>> >>> >> Subject: [llvm-dev] [RFC] Introducing a vector reduction add
>> >>> >> instruction.
>> >>> >>
>> >>> >> Hi
>> >>> >>
>> >>> >> When a reduction instruction is vectorized in a loop, it will be
>> >>> >> turned into an instruction with vector operands of the same
>> >>> >> operation
>> >>> >> type. This new instruction has a special property that can give us
>> >>> >> more flexibility during instruction selection later: this operation
>> >>> >> is valid as long as the reduction of all elements of the result
>> >>> >> vector is identical to the reduction of all elements of its
>> >>> >> operands.
>> >>> >>
>> >>> >> One example that can benefit this property is SAD (sum of absolute
>> >>> >> differences) pattern detection in SSE2, which provides a psadbw
>> >>> >> instruction whose description is shown below:
>> >>> >>
>> >>> >> '''
>> >>> >> psadbw: Compute the absolute differences of packed unsigned 8-bit
>> >>> >> integers in a and b, then horizontally sum each consecutive 8
>> >>> >> differences to produce two unsigned 16-bit integers, and pack these
>> >>> >> unsigned 16-bit integers in the low 16 bits of 64-bit elements in
>> >>> >> dst.
>> >>> >> '''
>> >>> >>
>> >>> >> In LLVM's IR, for a SAD loop we will have two v4i8 as inputs and
>> >>> >> one
>> >>> >> v4i32 as output. However, psadbw will actually produce one i32
>> >>> >> result
>> >>> >> for four pairs of 8-bit integers (an already reduced result), and
>> >>> >> the
>> >>> >> result is stored in the first element in v4i32. If we properly zero
>> >>> >> out the other three elements in v4i32, and with the information
>> >>> >> that
>> >>> >> we have a reduction add that is performed on this result, then we
>> >>> >> can
>> >>> >> safely use psadbw here for much better performance. This can be
>> >>> >> done
>> >>> >> during DAG combine. Another similar example is dot product. And I
>> >>> >> think there may be many other scenarios that can benefit from this
>> >>> >> property like eliminating redundant shuffles.
>> >>> >>
>> >>> >> The question is, how to let DAG combiner know that a vector
>> >>> >> operation
>> >>> >> is a reduction one?
>> >>> >>
>> >>> >> Here I propose to introduce a "reduction add" instruction for
>> >>> >> vectors.
>> >>> >> This will be a new instruction with vector operands only. Normally
>> >>> >> it
>> >>> >> is treated as a normal ADD operation, but the selection DAG
>> >>> >> combiner
>> >>> >> can make use of this new operation to generate better instructions.
>> >>> >> This new instruction is generated when vectorizing reduction add in
>> >>> >> loop vectorizer.
>> >>> >>
>> >>> >> I would like to hear more comments on this proposal or suggestions
>> >>> >> of
>> >>> >> better alternative implementations.
>> >>> >>
>> >>> >>
>> >>> >> thanks,
>> >>> >> Cong
>> >>> >> _______________________________________________
>> >>> >> LLVM Developers mailing list
>> >>> >> llvm-dev at lists.llvm.org
>> >>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>