[llvm-dev] [RFC] Introducing a vector reduction add instruction.

Hal Finkel via llvm-dev llvm-dev at lists.llvm.org
Wed Nov 25 14:32:07 PST 2015


Hi Cong,

After reading the original RFC and this update, I'm still not entirely sure I understand the semantics of the flag you're proposing to add. Does it having something to do with the ordering of the reduction operations?

Thanks again,
Hal

----- Original Message -----
> From: "Cong Hou via llvm-dev" <llvm-dev at lists.llvm.org>
> To: "llvm-dev" <llvm-dev at lists.llvm.org>
> Cc: "David Li" <davidxl at google.com>
> Sent: Thursday, November 19, 2015 3:12:10 PM
> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add instruction.
> 
> After some attempt to implement reduce-add in LLVM, I found out a
> easier way to detect reduce-add without introducing new IR
> operations.
> The basic idea is annotating phi node instead of add (so that it is
> easier to handle other reduction operations). In PHINode class, we
> can
> add a flag indicating if the phi node is a reduction one (the flag
> can
> be set in loop vectorizer for vectorized phi nodes). Then when we
> build SDNode for instruction selection, we detect those reduction phi
> nodes and then annotate reduction operations. This requires an
> additional flag in SDNodeFlags. We can then check this flag when
> combining instructions to detect reduction operations.
> 
> In this approach, I have managed to let LLVM compile a SAD loop into
> psadbw instructions.
> 
> Source code:
> 
> 
> const int N = 1024;
> unsigned char a[N], b[N];
> 
> int sad() {
>   int s = 0;
>   for (int i = 0; i < N; ++i) {
>     int res = a[i] - b[i];
>     s += (res > 0) ? res : -res;
>   }
>   return s;
> }
> 
> 
> Emitted instructions on X86:
> 
> 
> 
> # BB#0:                                 # %entry
> pxor %xmm0, %xmm0
> movq $-1024, %rax            # imm = 0xFFFFFFFFFFFFFC00
> pxor %xmm1, %xmm1
> .align 16, 0x90
> .LBB0_1:                                # %vector.body
>                                         # =>This Inner Loop Header:
>                                         Depth=1
> movd b+1024(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
> movd a+1024(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
> psadbw %xmm2, %xmm3
> paddd %xmm3, %xmm0
> movd b+1028(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
> movd a+1028(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
> psadbw %xmm2, %xmm3
> paddd %xmm3, %xmm1
> addq $8, %rax
> jne .LBB0_1
> # BB#2:                                 # %middle.block
> paddd %xmm0, %xmm1
> pshufd $78, %xmm1, %xmm0       # xmm0 = xmm1[2,3,0,1]
> paddd %xmm1, %xmm0
> pshufd $229, %xmm0, %xmm1      # xmm1 = xmm0[1,1,2,3]
> paddd %xmm0, %xmm1
> movd %xmm1, %eax
> retq
> 
> 
> Note that due to smaller VF we are using now (currently 4), we could
> not explore the most benefit of psadbw. The patch in
> http://reviews.llvm.org/D8943 has enables us to use bigger VFs based
> on the smallest type in a loop. The follow-up work is refining the
> cost model to let bigger VFs have less cost. For the example above
> the
> best result is from VF >=16.
> 
> The draft of the patch is here: http://reviews.llvm.org/D14840
> 
> I will refine the patch later and submit it for review.
> 
> 
> thanks,
> Cong
> 
> 
> On Wed, Nov 18, 2015 at 2:45 PM, Cong Hou <congh at google.com> wrote:
> > On Mon, Nov 16, 2015 at 9:31 PM, Shahid, Asghar-ahmad
> > <Asghar-ahmad.Shahid at amd.com> wrote:
> >> Hi Cong,
> >>
> >>> -----Original Message-----
> >>> From: Cong Hou [mailto:congh at google.com]
> >>> Sent: Tuesday, November 17, 2015 12:47 AM
> >>> To: Shahid, Asghar-ahmad
> >>> Cc: David Li
> >>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add
> >>> instruction.
> >>>
> >>> On Thu, Nov 12, 2015 at 9:37 PM, Shahid, Asghar-ahmad <Asghar-
> >>> ahmad.Shahid at amd.com> wrote:
> >>> > Hi Cong,
> >>> >
> >>> > We had proposed an intrinsic approach to do this. However the
> >>> > discussion reached to a point where it was asked that "Why do
> >>> > we need
> >>> > another approach if "reduction add" can be pattern matched in
> >>> DAGCombine?"
> >>> > However I feel if we have strong enough rationale for
> >>> > introduction of
> >>> > this instruction, it would be great. The 1st link below has the
> >>> > complete
> >>> discussion about the intrinsic approach.
> >>>
> >>> Yes, I think introducing such a reduction add can let us do
> >>> pattern recognition
> >>> of either SAD or dot production (or more?) without introducing
> >>> any
> >>> additional intrinsics.
> >> I agree. Another use case could be POPCOUNT operation. Moreover,
> >> as 'reduction add'
> >> Is being adopted by more targets now a days, reflecting that in
> >> LLVM IR as an instruction
> >> Is a good idea.
> >> BTW, what is the idea of the syntax and semantic of this operation
> >> you have?
> >
> > We can introduce a reduce-add for vectors only, or make it general
> > so
> > that it could also accept scale operands. Normally it is identical
> > to
> > normal add, but during instruction selection we can do pattern
> > recognition based on more information provided by this new
> > operations.
> > For vectors, this means the result of this operation only guarantee
> > that the sum of all elements in the result is identical to the sum
> > of
> > all elements of its operands. This gives us enough freedom to do
> > aggressive transformations, such as SAD or dot-product.
> >
> > Do you think if this is enough for us to get there?
> >
> >
> > Cong
> >
> >>
> >> The main concern may be cost
> >>> model: we could not guarantee that a SAD loop is unrolled 16
> >>> times on SSE to
> >>> make use the most benefit of SAD. After the patch
> >>> http://reviews.llvm.org/D8943 is landed, I am now working on
> >>> improving cost
> >>> models of type conversions on X86. I believe even without SAD
> >>> instruction
> >>> we can still get better performance with unrolling a SAD loop 16
> >>> times. This
> >>> seems tricky but it works. What do you think?
> >> I also think same as we can increase the bandwidth with proper
> >> cost modeling.
> >>
> >> Regards,
> >> Shahid
> >>>
> >>> Cong
> >>>
> >>> >
> >>> > http://reviews.llvm.org/D10964
> >>> > http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html
> >>> >
> >>> > Regards,
> >>> > Shahid
> >>> >
> >>> >
> >>> >
> >>> >> -----Original Message-----
> >>> >> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On
> >>> >> Behalf Of
> >>> >> Cong Hou via llvm-dev
> >>> >> Sent: Friday, November 13, 2015 5:47 AM
> >>> >> To: llvm-dev at lists.llvm.org
> >>> >> Cc: David Li
> >>> >> Subject: [llvm-dev] [RFC] Introducing a vector reduction add
> >>> >> instruction.
> >>> >>
> >>> >> Hi
> >>> >>
> >>> >> When a reduction instruction is vectorized in a loop, it will
> >>> >> be
> >>> >> turned into an instruction with vector operands of the same
> >>> >> operation
> >>> >> type. This new instruction has a special property that can
> >>> >> give us
> >>> >> more flexibility during instruction selection later: this
> >>> >> operation
> >>> >> is valid as long as the reduction of all elements of the
> >>> >> result
> >>> >> vector is identical to the reduction of all elements of its
> >>> >> operands.
> >>> >>
> >>> >> One example that can benefit this property is SAD (sum of
> >>> >> absolute
> >>> >> differences) pattern detection in SSE2, which provides a
> >>> >> psadbw
> >>> >> instruction whose description is shown below:
> >>> >>
> >>> >> '''
> >>> >> psadbw: Compute the absolute differences of packed unsigned
> >>> >> 8-bit
> >>> >> integers in a and b, then horizontally sum each consecutive 8
> >>> >> differences to produce two unsigned 16-bit integers, and pack
> >>> >> these
> >>> >> unsigned 16-bit integers in the low 16 bits of 64-bit elements
> >>> >> in dst.
> >>> >> '''
> >>> >>
> >>> >> In LLVM's IR, for a SAD loop we will have two v4i8 as inputs
> >>> >> and one
> >>> >> v4i32 as output. However, psadbw will actually produce one i32
> >>> >> result
> >>> >> for four pairs of 8-bit integers (an already reduced result),
> >>> >> and the
> >>> >> result is stored in the first element in v4i32. If we properly
> >>> >> zero
> >>> >> out the other three elements in v4i32, and with the
> >>> >> information that
> >>> >> we have a reduction add that is performed on this result, then
> >>> >> we can
> >>> >> safely use psadbw here for much better performance. This can
> >>> >> be done
> >>> >> during DAG combine. Another similar example is dot product.
> >>> >> And I
> >>> >> think there may be many other scenarios that can benefit from
> >>> >> this
> >>> >> property like eliminating redundant shuffles.
> >>> >>
> >>> >> The question is, how to let DAG combiner know that a vector
> >>> >> operation
> >>> >> is a reduction one?
> >>> >>
> >>> >> Here I propose to introduce a "reduction add" instruction for
> >>> >> vectors.
> >>> >> This will be a new instruction with vector operands only.
> >>> >> Normally it
> >>> >> is treated as a normal ADD operation, but the selection DAG
> >>> >> combiner
> >>> >> can make use of this new operation to generate better
> >>> >> instructions.
> >>> >> This new instruction is generated when vectorizing reduction
> >>> >> add in
> >>> >> loop vectorizer.
> >>> >>
> >>> >> I would like to hear more comments on this proposal or
> >>> >> suggestions of
> >>> >> better alternative implementations.
> >>> >>
> >>> >>
> >>> >> thanks,
> >>> >> Cong
> >>> >> _______________________________________________
> >>> >> LLVM Developers mailing list
> >>> >> llvm-dev at lists.llvm.org
> >>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory


More information about the llvm-dev mailing list