[llvm-dev] [RFC] Introducing a vector reduction add instruction.

Thu Nov 19 21:20:00 PST 2015

Hi Cong,

You hit the right place.Seems I could not get your question regarding this earlier.
This is what we did to find reduction pattern at IR level in loop vectorizer.
Adding SDNode flag seems to be the right thing to do.

>For the example above the best result is from VF >=16.

That is right. I also observed that for constant loop bound <16, the pattern gets unrolled
To straight line code. This may be due to the cost-model inefficiency. So after the improvement
Of cost-model this case would also be realized.

Regards,
Shahid

> -----Original Message-----
> From: Cong Hou [mailto:congh at google.com]
> Sent: Friday, November 20, 2015 2:42 AM
> To: llvm-dev
> Cc: David Li; Shahid, Asghar-ahmad
> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add instruction.
> 
> After some attempt to implement reduce-add in LLVM, I found out a easier
> way to detect reduce-add without introducing new IR operations.
> The basic idea is annotating phi node instead of add (so that it is easier to
> handle other reduction operations). In PHINode class, we can add a flag
> indicating if the phi node is a reduction one (the flag can be set in loop
> vectorizer for vectorized phi nodes). Then when we build SDNode for
> instruction selection, we detect those reduction phi nodes and then
> annotate reduction operations. This requires an additional flag in
> SDNodeFlags. We can then check this flag when combining instructions to
> detect reduction operations.
> 
> In this approach, I have managed to let LLVM compile a SAD loop into psadbw
> instructions.
> 
> Source code:
> 
> 
> const int N = 1024;
> unsigned char a[N], b[N];
> 
> int sad() {
>   int s = 0;
>   for (int i = 0; i < N; ++i) {
>     int res = a[i] - b[i];
>     s += (res > 0) ? res : -res;
>   }
>   return s;
> }
> 
> 
> Emitted instructions on X86:
> 
> 
> 
> # BB#0:                                 # %entry
> pxor %xmm0, %xmm0
> movq $-1024, %rax            # imm = 0xFFFFFFFFFFFFFC00
> pxor %xmm1, %xmm1
> .align 16, 0x90
> .LBB0_1:                                # %vector.body
>                                         # =>This Inner Loop Header: Depth=1
> movd b+1024(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
> movd a+1024(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
> psadbw %xmm2, %xmm3
> paddd %xmm3, %xmm0
> movd b+1028(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
> movd a+1028(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
> psadbw %xmm2, %xmm3
> paddd %xmm3, %xmm1
> addq $8, %rax
> jne .LBB0_1
> # BB#2:                                 # %middle.block
> paddd %xmm0, %xmm1
> pshufd $78, %xmm1, %xmm0       # xmm0 = xmm1[2,3,0,1]
> paddd %xmm1, %xmm0
> pshufd $229, %xmm0, %xmm1      # xmm1 = xmm0[1,1,2,3]
> paddd %xmm0, %xmm1
> movd %xmm1, %eax
> retq
> 
> 
> Note that due to smaller VF we are using now (currently 4), we could not
> explore the most benefit of psadbw. The patch in
> http://reviews.llvm.org/D8943 has enables us to use bigger VFs based on the
> smallest type in a loop. The follow-up work is refining the cost model to let
> bigger VFs have less cost. For the example above the best result is from VF
> >=16.
> 
> The draft of the patch is here: http://reviews.llvm.org/D14840
> 
> I will refine the patch later and submit it for review.
> 
> 
> thanks,
> Cong
> 
> 
> On Wed, Nov 18, 2015 at 2:45 PM, Cong Hou <congh at google.com> wrote:
> > On Mon, Nov 16, 2015 at 9:31 PM, Shahid, Asghar-ahmad
> > <Asghar-ahmad.Shahid at amd.com> wrote:
> >> Hi Cong,
> >>
> >>> -----Original Message-----
> >>> From: Cong Hou [mailto:congh at google.com]
> >>> Sent: Tuesday, November 17, 2015 12:47 AM
> >>> To: Shahid, Asghar-ahmad
> >>> Cc: David Li
> >>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add
> instruction.
> >>>
> >>> On Thu, Nov 12, 2015 at 9:37 PM, Shahid, Asghar-ahmad <Asghar-
> >>> ahmad.Shahid at amd.com> wrote:
> >>> > Hi Cong,
> >>> >
> >>> > We had proposed an intrinsic approach to do this. However the
> >>> > discussion reached to a point where it was asked that "Why do we
> >>> > need another approach if "reduction add" can be pattern matched in
> >>> DAGCombine?"
> >>> > However I feel if we have strong enough rationale for introduction
> >>> > of this instruction, it would be great. The 1st link below has the
> >>> > complete
> >>> discussion about the intrinsic approach.
> >>>
> >>> Yes, I think introducing such a reduction add can let us do pattern
> >>> recognition of either SAD or dot production (or more?) without
> >>> introducing any additional intrinsics.
> >> I agree. Another use case could be POPCOUNT operation. Moreover, as
> 'reduction add'
> >> Is being adopted by more targets now a days, reflecting that in LLVM
> >> IR as an instruction Is a good idea.
> >> BTW, what is the idea of the syntax and semantic of this operation you
> have?
> >
> > We can introduce a reduce-add for vectors only, or make it general so
> > that it could also accept scale operands. Normally it is identical to
> > normal add, but during instruction selection we can do pattern
> > recognition based on more information provided by this new operations.
> > For vectors, this means the result of this operation only guarantee
> > that the sum of all elements in the result is identical to the sum of
> > all elements of its operands. This gives us enough freedom to do
> > aggressive transformations, such as SAD or dot-product.
> >
> > Do you think if this is enough for us to get there?
> >
> >
> > Cong
> >
> >>
> >> The main concern may be cost
> >>> model: we could not guarantee that a SAD loop is unrolled 16 times
> >>> on SSE to make use the most benefit of SAD. After the patch
> >>> http://reviews.llvm.org/D8943 is landed, I am now working on
> >>> improving cost models of type conversions on X86. I believe even
> >>> without SAD instruction we can still get better performance with
> >>> unrolling a SAD loop 16 times. This seems tricky but it works. What do
> you think?
> >> I also think same as we can increase the bandwidth with proper cost
> modeling.
> >>
> >> Regards,
> >> Shahid
> >>>
> >>> Cong
> >>>
> >>> >
> >>> > http://reviews.llvm.org/D10964
> >>> > http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html
> >>> >
> >>> > Regards,
> >>> > Shahid
> >>> >
> >>> >
> >>> >
> >>> >> -----Original Message-----
> >>> >> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf
> >>> >> Of Cong Hou via llvm-dev
> >>> >> Sent: Friday, November 13, 2015 5:47 AM
> >>> >> To: llvm-dev at lists.llvm.org
> >>> >> Cc: David Li
> >>> >> Subject: [llvm-dev] [RFC] Introducing a vector reduction add
> instruction.
> >>> >>
> >>> >> Hi
> >>> >>
> >>> >> When a reduction instruction is vectorized in a loop, it will be
> >>> >> turned into an instruction with vector operands of the same
> >>> >> operation type. This new instruction has a special property that
> >>> >> can give us more flexibility during instruction selection later:
> >>> >> this operation is valid as long as the reduction of all elements
> >>> >> of the result vector is identical to the reduction of all elements of its
> operands.
> >>> >>
> >>> >> One example that can benefit this property is SAD (sum of
> >>> >> absolute
> >>> >> differences) pattern detection in SSE2, which provides a psadbw
> >>> >> instruction whose description is shown below:
> >>> >>
> >>> >> '''
> >>> >> psadbw: Compute the absolute differences of packed unsigned 8-bit
> >>> >> integers in a and b, then horizontally sum each consecutive 8
> >>> >> differences to produce two unsigned 16-bit integers, and pack
> >>> >> these unsigned 16-bit integers in the low 16 bits of 64-bit elements in
> dst.
> >>> >> '''
> >>> >>
> >>> >> In LLVM's IR, for a SAD loop we will have two v4i8 as inputs and
> >>> >> one
> >>> >> v4i32 as output. However, psadbw will actually produce one i32
> >>> >> result for four pairs of 8-bit integers (an already reduced
> >>> >> result), and the result is stored in the first element in v4i32.
> >>> >> If we properly zero out the other three elements in v4i32, and
> >>> >> with the information that we have a reduction add that is
> >>> >> performed on this result, then we can safely use psadbw here for
> >>> >> much better performance. This can be done during DAG combine.
> >>> >> Another similar example is dot product. And I think there may be
> >>> >> many other scenarios that can benefit from this property like
> eliminating redundant shuffles.
> >>> >>
> >>> >> The question is, how to let DAG combiner know that a vector
> >>> >> operation is a reduction one?
> >>> >>
> >>> >> Here I propose to introduce a "reduction add" instruction for vectors.
> >>> >> This will be a new instruction with vector operands only.
> >>> >> Normally it is treated as a normal ADD operation, but the
> >>> >> selection DAG combiner can make use of this new operation to
> generate better instructions.
> >>> >> This new instruction is generated when vectorizing reduction add
> >>> >> in loop vectorizer.
> >>> >>
> >>> >> I would like to hear more comments on this proposal or
> >>> >> suggestions of better alternative implementations.
> >>> >>
> >>> >>
> >>> >> thanks,
> >>> >> Cong
> >>> >> _______________________________________________
> >>> >> LLVM Developers mailing list
> >>> >> llvm-dev at lists.llvm.org
> >>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev