[llvm-dev] [RFC] Introducing a vector reduction add instruction.

Wed Nov 25 15:17:58 PST 2015

Hal is probably not questioning about the usefulness of reduction
recognition and a way to represent it, but the clear semantics of the flag.
You can probably draw some ideas from OMP SIMD reduction clause, or intel's
SIMD pragma's reduction clause.

https://software.intel.com/en-us/articles/enabling-simd-in-program-using-openmp40

David

On Wed, Nov 25, 2015 at 3:07 PM, Cong Hou <congh at google.com> wrote:

> On Wed, Nov 25, 2015 at 2:32 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> > Hi Cong,
> >
> > After reading the original RFC and this update, I'm still not entirely
> sure I understand the semantics of the flag you're proposing to add. Does
> it having something to do with the ordering of the reduction operations?
>
> The flag is only useful for vectorized reduction for now. I'll give
> you an example how this flag is used.
>
> Given the following reduction loop:
>
> int a[N];
> int s = 0;
> for (int i = 0; i < N; ++i)
>   s += a[i];
>
> After it is vectorized, we have a reduction operation add whose
> operands and the result are vectors. Suppose it is represented as:
>
> [s0, s1, s2, s3] = [s0, s1, s2, s3] + [a0, a1, a2, a3].
>
> If we know this operation is a reduction one, then we could have some
> flexibility on how elements in the result are organized, as long as
> the sum of them stays the same (the precondition is that the only use
> of the reduction result is the reduction phi node, and this is usually
> true as long as a reduction loop can be vectorized). For example, if
> we let the result be [s0+s1, 0, s2+s3, 0] or [0, 0, s0+s1+s2+s3, 0],
> the reduction result won't change. This enable us to detect SAD or
> dot-product patterns and use SSE's psadbw and pmaddwd instructions.
> Please see my respond to your another email for more details.
>
> Thanks!
>
> Cong
>
> >
> > Thanks again,
> > Hal
> >
> > ----- Original Message -----
> >> From: "Cong Hou via llvm-dev" <llvm-dev at lists.llvm.org>
> >> To: "llvm-dev" <llvm-dev at lists.llvm.org>
> >> Cc: "David Li" <davidxl at google.com>
> >> Sent: Thursday, November 19, 2015 3:12:10 PM
> >> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add
> instruction.
> >>
> >> After some attempt to implement reduce-add in LLVM, I found out a
> >> easier way to detect reduce-add without introducing new IR
> >> operations.
> >> The basic idea is annotating phi node instead of add (so that it is
> >> easier to handle other reduction operations). In PHINode class, we
> >> can
> >> add a flag indicating if the phi node is a reduction one (the flag
> >> can
> >> be set in loop vectorizer for vectorized phi nodes). Then when we
> >> build SDNode for instruction selection, we detect those reduction phi
> >> nodes and then annotate reduction operations. This requires an
> >> additional flag in SDNodeFlags. We can then check this flag when
> >> combining instructions to detect reduction operations.
> >>
> >> In this approach, I have managed to let LLVM compile a SAD loop into
> >> psadbw instructions.
> >>
> >> Source code:
> >>
> >>
> >> const int N = 1024;
> >> unsigned char a[N], b[N];
> >>
> >> int sad() {
> >>   int s = 0;
> >>   for (int i = 0; i < N; ++i) {
> >>     int res = a[i] - b[i];
> >>     s += (res > 0) ? res : -res;
> >>   }
> >>   return s;
> >> }
> >>
> >>
> >> Emitted instructions on X86:
> >>
> >>
> >>
> >> # BB#0:                                 # %entry
> >> pxor %xmm0, %xmm0
> >> movq $-1024, %rax            # imm = 0xFFFFFFFFFFFFFC00
> >> pxor %xmm1, %xmm1
> >> .align 16, 0x90
> >> .LBB0_1:                                # %vector.body
> >>                                         # =>This Inner Loop Header:
> >>                                         Depth=1
> >> movd b+1024(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
> >> movd a+1024(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
> >> psadbw %xmm2, %xmm3
> >> paddd %xmm3, %xmm0
> >> movd b+1028(%rax), %xmm2     # xmm2 = mem[0],zero,zero,zero
> >> movd a+1028(%rax), %xmm3     # xmm3 = mem[0],zero,zero,zero
> >> psadbw %xmm2, %xmm3
> >> paddd %xmm3, %xmm1
> >> addq $8, %rax
> >> jne .LBB0_1
> >> # BB#2:                                 # %middle.block
> >> paddd %xmm0, %xmm1
> >> pshufd $78, %xmm1, %xmm0       # xmm0 = xmm1[2,3,0,1]
> >> paddd %xmm1, %xmm0
> >> pshufd $229, %xmm0, %xmm1      # xmm1 = xmm0[1,1,2,3]
> >> paddd %xmm0, %xmm1
> >> movd %xmm1, %eax
> >> retq
> >>
> >>
> >> Note that due to smaller VF we are using now (currently 4), we could
> >> not explore the most benefit of psadbw. The patch in
> >> http://reviews.llvm.org/D8943 has enables us to use bigger VFs based
> >> on the smallest type in a loop. The follow-up work is refining the
> >> cost model to let bigger VFs have less cost. For the example above
> >> the
> >> best result is from VF >=16.
> >>
> >> The draft of the patch is here: http://reviews.llvm.org/D14840
> >>
> >> I will refine the patch later and submit it for review.
> >>
> >>
> >> thanks,
> >> Cong
> >>
> >>
> >> On Wed, Nov 18, 2015 at 2:45 PM, Cong Hou <congh at google.com> wrote:
> >> > On Mon, Nov 16, 2015 at 9:31 PM, Shahid, Asghar-ahmad
> >> > <Asghar-ahmad.Shahid at amd.com> wrote:
> >> >> Hi Cong,
> >> >>
> >> >>> -----Original Message-----
> >> >>> From: Cong Hou [mailto:congh at google.com]
> >> >>> Sent: Tuesday, November 17, 2015 12:47 AM
> >> >>> To: Shahid, Asghar-ahmad
> >> >>> Cc: David Li
> >> >>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add
> >> >>> instruction.
> >> >>>
> >> >>> On Thu, Nov 12, 2015 at 9:37 PM, Shahid, Asghar-ahmad <Asghar-
> >> >>> ahmad.Shahid at amd.com> wrote:
> >> >>> > Hi Cong,
> >> >>> >
> >> >>> > We had proposed an intrinsic approach to do this. However the
> >> >>> > discussion reached to a point where it was asked that "Why do
> >> >>> > we need
> >> >>> > another approach if "reduction add" can be pattern matched in
> >> >>> DAGCombine?"
> >> >>> > However I feel if we have strong enough rationale for
> >> >>> > introduction of
> >> >>> > this instruction, it would be great. The 1st link below has the
> >> >>> > complete
> >> >>> discussion about the intrinsic approach.
> >> >>>
> >> >>> Yes, I think introducing such a reduction add can let us do
> >> >>> pattern recognition
> >> >>> of either SAD or dot production (or more?) without introducing
> >> >>> any
> >> >>> additional intrinsics.
> >> >> I agree. Another use case could be POPCOUNT operation. Moreover,
> >> >> as 'reduction add'
> >> >> Is being adopted by more targets now a days, reflecting that in
> >> >> LLVM IR as an instruction
> >> >> Is a good idea.
> >> >> BTW, what is the idea of the syntax and semantic of this operation
> >> >> you have?
> >> >
> >> > We can introduce a reduce-add for vectors only, or make it general
> >> > so
> >> > that it could also accept scale operands. Normally it is identical
> >> > to
> >> > normal add, but during instruction selection we can do pattern
> >> > recognition based on more information provided by this new
> >> > operations.
> >> > For vectors, this means the result of this operation only guarantee
> >> > that the sum of all elements in the result is identical to the sum
> >> > of
> >> > all elements of its operands. This gives us enough freedom to do
> >> > aggressive transformations, such as SAD or dot-product.
> >> >
> >> > Do you think if this is enough for us to get there?
> >> >
> >> >
> >> > Cong
> >> >
> >> >>
> >> >> The main concern may be cost
> >> >>> model: we could not guarantee that a SAD loop is unrolled 16
> >> >>> times on SSE to
> >> >>> make use the most benefit of SAD. After the patch
> >> >>> http://reviews.llvm.org/D8943 is landed, I am now working on
> >> >>> improving cost
> >> >>> models of type conversions on X86. I believe even without SAD
> >> >>> instruction
> >> >>> we can still get better performance with unrolling a SAD loop 16
> >> >>> times. This
> >> >>> seems tricky but it works. What do you think?
> >> >> I also think same as we can increase the bandwidth with proper
> >> >> cost modeling.
> >> >>
> >> >> Regards,
> >> >> Shahid
> >> >>>
> >> >>> Cong
> >> >>>
> >> >>> >
> >> >>> > http://reviews.llvm.org/D10964
> >> >>> > http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html
> >> >>> >
> >> >>> > Regards,
> >> >>> > Shahid
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> >> -----Original Message-----
> >> >>> >> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On
> >> >>> >> Behalf Of
> >> >>> >> Cong Hou via llvm-dev
> >> >>> >> Sent: Friday, November 13, 2015 5:47 AM
> >> >>> >> To: llvm-dev at lists.llvm.org
> >> >>> >> Cc: David Li
> >> >>> >> Subject: [llvm-dev] [RFC] Introducing a vector reduction add
> >> >>> >> instruction.
> >> >>> >>
> >> >>> >> Hi
> >> >>> >>
> >> >>> >> When a reduction instruction is vectorized in a loop, it will
> >> >>> >> be
> >> >>> >> turned into an instruction with vector operands of the same
> >> >>> >> operation
> >> >>> >> type. This new instruction has a special property that can
> >> >>> >> give us
> >> >>> >> more flexibility during instruction selection later: this
> >> >>> >> operation
> >> >>> >> is valid as long as the reduction of all elements of the
> >> >>> >> result
> >> >>> >> vector is identical to the reduction of all elements of its
> >> >>> >> operands.
> >> >>> >>
> >> >>> >> One example that can benefit this property is SAD (sum of
> >> >>> >> absolute
> >> >>> >> differences) pattern detection in SSE2, which provides a
> >> >>> >> psadbw
> >> >>> >> instruction whose description is shown below:
> >> >>> >>
> >> >>> >> '''
> >> >>> >> psadbw: Compute the absolute differences of packed unsigned
> >> >>> >> 8-bit
> >> >>> >> integers in a and b, then horizontally sum each consecutive 8
> >> >>> >> differences to produce two unsigned 16-bit integers, and pack
> >> >>> >> these
> >> >>> >> unsigned 16-bit integers in the low 16 bits of 64-bit elements
> >> >>> >> in dst.
> >> >>> >> '''
> >> >>> >>
> >> >>> >> In LLVM's IR, for a SAD loop we will have two v4i8 as inputs
> >> >>> >> and one
> >> >>> >> v4i32 as output. However, psadbw will actually produce one i32
> >> >>> >> result
> >> >>> >> for four pairs of 8-bit integers (an already reduced result),
> >> >>> >> and the
> >> >>> >> result is stored in the first element in v4i32. If we properly
> >> >>> >> zero
> >> >>> >> out the other three elements in v4i32, and with the
> >> >>> >> information that
> >> >>> >> we have a reduction add that is performed on this result, then
> >> >>> >> we can
> >> >>> >> safely use psadbw here for much better performance. This can
> >> >>> >> be done
> >> >>> >> during DAG combine. Another similar example is dot product.
> >> >>> >> And I
> >> >>> >> think there may be many other scenarios that can benefit from
> >> >>> >> this
> >> >>> >> property like eliminating redundant shuffles.
> >> >>> >>
> >> >>> >> The question is, how to let DAG combiner know that a vector
> >> >>> >> operation
> >> >>> >> is a reduction one?
> >> >>> >>
> >> >>> >> Here I propose to introduce a "reduction add" instruction for
> >> >>> >> vectors.
> >> >>> >> This will be a new instruction with vector operands only.
> >> >>> >> Normally it
> >> >>> >> is treated as a normal ADD operation, but the selection DAG
> >> >>> >> combiner
> >> >>> >> can make use of this new operation to generate better
> >> >>> >> instructions.
> >> >>> >> This new instruction is generated when vectorizing reduction
> >> >>> >> add in
> >> >>> >> loop vectorizer.
> >> >>> >>
> >> >>> >> I would like to hear more comments on this proposal or
> >> >>> >> suggestions of
> >> >>> >> better alternative implementations.
> >> >>> >>
> >> >>> >>
> >> >>> >> thanks,
> >> >>> >> Cong
> >> >>> >> _______________________________________________
> >> >>> >> LLVM Developers mailing list
> >> >>> >> llvm-dev at lists.llvm.org
> >> >>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151125/f58f0af7/attachment.html>