[llvm-dev] [RFC] Introducing a vector reduction add instruction.

Mon Nov 30 13:47:14 PST 2015

On Thu, Nov 26, 2015 at 8:39 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
>> From: "Cong Hou" <congh at google.com>
>> To: "Hal Finkel" <hfinkel at anl.gov>
>> Cc: "Xinliang David Li" <davidxl at google.com>, "llvm-dev" <llvm-dev at lists.llvm.org>
>> Sent: Wednesday, November 25, 2015 6:33:04 PM
>> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add instruction.
>>
>> On Wed, Nov 25, 2015 at 3:44 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>> > ----- Original Message -----
>> >> From: "Xinliang David Li" <davidxl at google.com>
>> >> To: "Cong Hou" <congh at google.com>
>> >> Cc: "Hal Finkel" <hfinkel at anl.gov>, "llvm-dev"
>> >> <llvm-dev at lists.llvm.org>
>> >> Sent: Wednesday, November 25, 2015 5:17:58 PM
>> >> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction add
>> >> instruction.
>> >>
>> >>
>> >> Hal is probably not questioning about the usefulness of reduction
>> >> recognition and a way to represent it, but the clear semantics of
>> >> the flag. You can probably draw some ideas from OMP SIMD reduction
>> >> clause, or intel's SIMD pragma's reduction clause.
>> >
>> > True, but nevertheless, Cong's reply was useful. Here's my
>> > interpretation so far:
>> >
>> > Placing this flag on a PHI node, which is only valid for a
>> > vector-valued PHI, indicates that only the sum of vector elements
>> > is meaningful. This could easily be extended to cover any
>> > associative operation (only the product is useful, only the
>> > maximum or minimum value is useful, etc.).
>>
>> Right.
>>
>> >
>> > Now I completely understand why the flag is useful at the SDAG
>> > level. Because SDAG is basic-block local, we can't examine the
>> > loop structure when doing instruction selection for the relevant
>> > operations composing the psadbw (and friends). We also need to
>> > realize, when lowering the horizontal reduction at the end of the
>> > loop, to lower it in some more-trivial way (right?).
>>
>> The benefit of collecting the result outside of the loop is trivial
>> unless it is in an outer loop. And I am afraid if the reduction info
>> helps here as the way how the results are collected is determined
>> already after the loop is vectorized.
>
> Yes, but "unless it is in an outer loop" is an important case in practice. The underlying situation is: Given that the target has lowered the instructions producing the PHI value to ones that always leave certain vector elements zero, we should be able to use that information when simplifying/lowering the horizontal sum in the exit block.
>
> One way to handle this is to enhance the SelectionDAGISel::ComputeLiveOutVRegInfo() function, and related infrastructure, to understand what is going on (with target help), so that by the time we get to the exit block, the necessary information is known and can be used to simplify the incoming vector values. This can certainly be considered follow-up work.

Sounds great. It is also possible to move the collection of reduction
vectors from outer loop to outside of it when the collected result is
also a reduction variable of the outer loop.

>
>>
>> >
>> > Regarding the metadata at the IR level: the motivation here is
>> > that, without it, the SDAG builder would need to examine the uses
>> > of the PHI, determine that the only uses were shuffles and
>> > extracts representing a horizontal reduction (add, etc.), and then
>> > deduce that it could add the flag from that.
>>
>> The reduction pattern detection for the vectorized loop at this point
>> can be difficult for the following reasons:
>>
>>
>> 1. The vectorized code outside of the loop which collects results in
>> the vector into a scalar can be messy. For example, there may be many
>> shuffles, which is not straightforward to understand (although it is
>> still possible to detected them).
>>
>> 2. After loop unrolling, the reduction operation is duplicated
>> several
>> times but the phi node is not. We should be able to detect all those
>> unrolled reductions but this task is also challenging.
>>
>> 3. We need to make sure there is no other uses of the result of the
>> reduction operation (those operations after loop unrolling is an
>> exception, where the result can be used by another copy of the same
>> reduction operation).
>>
>>
>> However, all those information can be easily obtained during loop
>> vectorization. So why not record it in that stage?
>
> Because it adds yet-another place (PHI nodes) that we need to worry about preserving metadata, and it adds information redundant with that already contained in the IR. Generally, we try not to do that.
>
> If it were the case that re-deriving that information during SDAG building would be non-trivial (because, for example, it would require pulling in some expensive analysis), I'd consider that a valid reason. That does not seem to be the case, however.
>
> We need to be very careful not to fall into the trap of having a "magic vectorizer", meaning a vectorizer that knows how to induce special code-generation functionality not otherwise available to generic input code. It is trivial to write, using completely generic LLVM IR with vector types, the code that the vectorizer would produce. A frontend might generate this directly, for example. It is important that we treat such code as a "first-class citizen" within our framework. Thus, I'd highly prefer that we pattern match the necessary IR in the SDAG builder instead of adding metadata in the vectorizer.

I generally agree with you. We can retrieve the information that an
operation is a reduction one by pattern matching, which is a little
more expensive though. As it eliminates unnecessary or redundant
information recording in loop vectorizer, I think this idea is
reasonable. Loop unrolling makes the case more complicated but it is
still doable. I will try to implement this approach.

Thank you very much for the suggestion!

Cong

>
>>
>> >
>> > If this matching is practical (and I suspect that it is, given that
>> > we already do it in the backends to match horizontal adds), this
>> > seems better than using vectorizer-added metadata. In this way, IR
>> > generated using vector intrinsics (etc.) can receive the same good
>> > code generation as that generated by the vectorizer.
>>
>> This is a good point. But I am not sure if it is worth the effort to
>> write a reduction detector for instrinsics. But the whole idea is
>> flexible to accept any reduction detector.
>
> When I say "intrinsics", I mean using native LLVM vector types.
>
> Thanks again,
> Hal
>
>>
>> Cong
>>
>> >
>> > Thanks again,
>> > Hal
>> >
>> >>
>> >>
>> >> https://software.intel.com/en-us/articles/enabling-simd-in-program-using-openmp40
>> >>
>> >>
>> >>
>> >> David
>> >>
>> >>
>> >> On Wed, Nov 25, 2015 at 3:07 PM, Cong Hou < congh at google.com >
>> >> wrote:
>> >>
>> >>
>> >> On Wed, Nov 25, 2015 at 2:32 PM, Hal Finkel < hfinkel at anl.gov >
>> >> wrote:
>> >> > Hi Cong,
>> >> >
>> >> > After reading the original RFC and this update, I'm still not
>> >> > entirely sure I understand the semantics of the flag you're
>> >> > proposing to add. Does it having something to do with the
>> >> > ordering
>> >> > of the reduction operations?
>> >>
>> >> The flag is only useful for vectorized reduction for now. I'll
>> >> give
>> >> you an example how this flag is used.
>> >>
>> >> Given the following reduction loop:
>> >>
>> >> int a[N];
>> >> int s = 0;
>> >> for (int i = 0; i < N; ++i)
>> >> s += a[i];
>> >>
>> >> After it is vectorized, we have a reduction operation add whose
>> >> operands and the result are vectors. Suppose it is represented as:
>> >>
>> >> [s0, s1, s2, s3] = [s0, s1, s2, s3] + [a0, a1, a2, a3].
>> >>
>> >> If we know this operation is a reduction one, then we could have
>> >> some
>> >> flexibility on how elements in the result are organized, as long
>> >> as
>> >> the sum of them stays the same (the precondition is that the only
>> >> use
>> >> of the reduction result is the reduction phi node, and this is
>> >> usually
>> >> true as long as a reduction loop can be vectorized). For example,
>> >> if
>> >> we let the result be [s0+s1, 0, s2+s3, 0] or [0, 0, s0+s1+s2+s3,
>> >> 0],
>> >> the reduction result won't change. This enable us to detect SAD or
>> >> dot-product patterns and use SSE's psadbw and pmaddwd
>> >> instructions.
>> >> Please see my respond to your another email for more details.
>> >>
>> >> Thanks!
>> >>
>> >> Cong
>> >>
>> >>
>> >>
>> >> >
>> >> > Thanks again,
>> >> > Hal
>> >> >
>> >> > ----- Original Message -----
>> >> >> From: "Cong Hou via llvm-dev" < llvm-dev at lists.llvm.org >
>> >> >> To: "llvm-dev" < llvm-dev at lists.llvm.org >
>> >> >> Cc: "David Li" < davidxl at google.com >
>> >> >> Sent: Thursday, November 19, 2015 3:12:10 PM
>> >> >> Subject: Re: [llvm-dev] [RFC] Introducing a vector reduction
>> >> >> add
>> >> >> instruction.
>> >> >>
>> >> >> After some attempt to implement reduce-add in LLVM, I found out
>> >> >> a
>> >> >> easier way to detect reduce-add without introducing new IR
>> >> >> operations.
>> >> >> The basic idea is annotating phi node instead of add (so that
>> >> >> it
>> >> >> is
>> >> >> easier to handle other reduction operations). In PHINode class,
>> >> >> we
>> >> >> can
>> >> >> add a flag indicating if the phi node is a reduction one (the
>> >> >> flag
>> >> >> can
>> >> >> be set in loop vectorizer for vectorized phi nodes). Then when
>> >> >> we
>> >> >> build SDNode for instruction selection, we detect those
>> >> >> reduction
>> >> >> phi
>> >> >> nodes and then annotate reduction operations. This requires an
>> >> >> additional flag in SDNodeFlags. We can then check this flag
>> >> >> when
>> >> >> combining instructions to detect reduction operations.
>> >> >>
>> >> >> In this approach, I have managed to let LLVM compile a SAD loop
>> >> >> into
>> >> >> psadbw instructions.
>> >> >>
>> >> >> Source code:
>> >> >>
>> >> >>
>> >> >> const int N = 1024;
>> >> >> unsigned char a[N], b[N];
>> >> >>
>> >> >> int sad() {
>> >> >> int s = 0;
>> >> >> for (int i = 0; i < N; ++i) {
>> >> >> int res = a[i] - b[i];
>> >> >> s += (res > 0) ? res : -res;
>> >> >> }
>> >> >> return s;
>> >> >> }
>> >> >>
>> >> >>
>> >> >> Emitted instructions on X86:
>> >> >>
>> >> >>
>> >> >>
>> >> >> # BB#0: # %entry
>> >> >> pxor %xmm0, %xmm0
>> >> >> movq $-1024, %rax # imm = 0xFFFFFFFFFFFFFC00
>> >> >> pxor %xmm1, %xmm1
>> >> >> .align 16, 0x90
>> >> >> .LBB0_1: # %vector.body
>> >> >> # =>This Inner Loop Header:
>> >> >> Depth=1
>> >> >> movd b+1024(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero
>> >> >> movd a+1024(%rax), %xmm3 # xmm3 = mem[0],zero,zero,zero
>> >> >> psadbw %xmm2, %xmm3
>> >> >> paddd %xmm3, %xmm0
>> >> >> movd b+1028(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero
>> >> >> movd a+1028(%rax), %xmm3 # xmm3 = mem[0],zero,zero,zero
>> >> >> psadbw %xmm2, %xmm3
>> >> >> paddd %xmm3, %xmm1
>> >> >> addq $8, %rax
>> >> >> jne .LBB0_1
>> >> >> # BB#2: # %middle.block
>> >> >> paddd %xmm0, %xmm1
>> >> >> pshufd $78, %xmm1, %xmm0 # xmm0 = xmm1[2,3,0,1]
>> >> >> paddd %xmm1, %xmm0
>> >> >> pshufd $229, %xmm0, %xmm1 # xmm1 = xmm0[1,1,2,3]
>> >> >> paddd %xmm0, %xmm1
>> >> >> movd %xmm1, %eax
>> >> >> retq
>> >> >>
>> >> >>
>> >> >> Note that due to smaller VF we are using now (currently 4), we
>> >> >> could
>> >> >> not explore the most benefit of psadbw. The patch in
>> >> >> http://reviews.llvm.org/D8943 has enables us to use bigger VFs
>> >> >> based
>> >> >> on the smallest type in a loop. The follow-up work is refining
>> >> >> the
>> >> >> cost model to let bigger VFs have less cost. For the example
>> >> >> above
>> >> >> the
>> >> >> best result is from VF >=16.
>> >> >>
>> >> >> The draft of the patch is here: http://reviews.llvm.org/D14840
>> >> >>
>> >> >> I will refine the patch later and submit it for review.
>> >> >>
>> >> >>
>> >> >> thanks,
>> >> >> Cong
>> >> >>
>> >> >>
>> >> >> On Wed, Nov 18, 2015 at 2:45 PM, Cong Hou < congh at google.com >
>> >> >> wrote:
>> >> >> > On Mon, Nov 16, 2015 at 9:31 PM, Shahid, Asghar-ahmad
>> >> >> > < Asghar-ahmad.Shahid at amd.com > wrote:
>> >> >> >> Hi Cong,
>> >> >> >>
>> >> >> >>> -----Original Message-----
>> >> >> >>> From: Cong Hou [mailto: congh at google.com ]
>> >> >> >>> Sent: Tuesday, November 17, 2015 12:47 AM
>> >> >> >>> To: Shahid, Asghar-ahmad
>> >> >> >>> Cc: David Li
>> >> >> >>> Subject: Re: [llvm-dev] [RFC] Introducing a vector
>> >> >> >>> reduction
>> >> >> >>> add
>> >> >> >>> instruction.
>> >> >> >>>
>> >> >> >>> On Thu, Nov 12, 2015 at 9:37 PM, Shahid, Asghar-ahmad
>> >> >> >>> <Asghar-
>> >> >> >>> ahmad.Shahid at amd.com > wrote:
>> >> >> >>> > Hi Cong,
>> >> >> >>> >
>> >> >> >>> > We had proposed an intrinsic approach to do this. However
>> >> >> >>> > the
>> >> >> >>> > discussion reached to a point where it was asked that
>> >> >> >>> > "Why
>> >> >> >>> > do
>> >> >> >>> > we need
>> >> >> >>> > another approach if "reduction add" can be pattern
>> >> >> >>> > matched
>> >> >> >>> > in
>> >> >> >>> DAGCombine?"
>> >> >> >>> > However I feel if we have strong enough rationale for
>> >> >> >>> > introduction of
>> >> >> >>> > this instruction, it would be great. The 1st link below
>> >> >> >>> > has
>> >> >> >>> > the
>> >> >> >>> > complete
>> >> >> >>> discussion about the intrinsic approach.
>> >> >> >>>
>> >> >> >>> Yes, I think introducing such a reduction add can let us do
>> >> >> >>> pattern recognition
>> >> >> >>> of either SAD or dot production (or more?) without
>> >> >> >>> introducing
>> >> >> >>> any
>> >> >> >>> additional intrinsics.
>> >> >> >> I agree. Another use case could be POPCOUNT operation.
>> >> >> >> Moreover,
>> >> >> >> as 'reduction add'
>> >> >> >> Is being adopted by more targets now a days, reflecting that
>> >> >> >> in
>> >> >> >> LLVM IR as an instruction
>> >> >> >> Is a good idea.
>> >> >> >> BTW, what is the idea of the syntax and semantic of this
>> >> >> >> operation
>> >> >> >> you have?
>> >> >> >
>> >> >> > We can introduce a reduce-add for vectors only, or make it
>> >> >> > general
>> >> >> > so
>> >> >> > that it could also accept scale operands. Normally it is
>> >> >> > identical
>> >> >> > to
>> >> >> > normal add, but during instruction selection we can do
>> >> >> > pattern
>> >> >> > recognition based on more information provided by this new
>> >> >> > operations.
>> >> >> > For vectors, this means the result of this operation only
>> >> >> > guarantee
>> >> >> > that the sum of all elements in the result is identical to
>> >> >> > the
>> >> >> > sum
>> >> >> > of
>> >> >> > all elements of its operands. This gives us enough freedom to
>> >> >> > do
>> >> >> > aggressive transformations, such as SAD or dot-product.
>> >> >> >
>> >> >> > Do you think if this is enough for us to get there?
>> >> >> >
>> >> >> >
>> >> >> > Cong
>> >> >> >
>> >> >> >>
>> >> >> >> The main concern may be cost
>> >> >> >>> model: we could not guarantee that a SAD loop is unrolled
>> >> >> >>> 16
>> >> >> >>> times on SSE to
>> >> >> >>> make use the most benefit of SAD. After the patch
>> >> >> >>> http://reviews.llvm.org/D8943 is landed, I am now working
>> >> >> >>> on
>> >> >> >>> improving cost
>> >> >> >>> models of type conversions on X86. I believe even without
>> >> >> >>> SAD
>> >> >> >>> instruction
>> >> >> >>> we can still get better performance with unrolling a SAD
>> >> >> >>> loop
>> >> >> >>> 16
>> >> >> >>> times. This
>> >> >> >>> seems tricky but it works. What do you think?
>> >> >> >> I also think same as we can increase the bandwidth with
>> >> >> >> proper
>> >> >> >> cost modeling.
>> >> >> >>
>> >> >> >> Regards,
>> >> >> >> Shahid
>> >> >> >>>
>> >> >> >>> Cong
>> >> >> >>>
>> >> >> >>> >
>> >> >> >>> > http://reviews.llvm.org/D10964
>> >> >> >>> > http://lists.llvm.org/pipermail/llvm-dev/2015-May/085078.html
>> >> >> >>> >
>> >> >> >>> > Regards,
>> >> >> >>> > Shahid
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> >> -----Original Message-----
>> >> >> >>> >> From: llvm-dev [mailto: llvm-dev-bounces at lists.llvm.org
>> >> >> >>> >> ]
>> >> >> >>> >> On
>> >> >> >>> >> Behalf Of
>> >> >> >>> >> Cong Hou via llvm-dev
>> >> >> >>> >> Sent: Friday, November 13, 2015 5:47 AM
>> >> >> >>> >> To: llvm-dev at lists.llvm.org
>> >> >> >>> >> Cc: David Li
>> >> >> >>> >> Subject: [llvm-dev] [RFC] Introducing a vector reduction
>> >> >> >>> >> add
>> >> >> >>> >> instruction.
>> >> >> >>> >>
>> >> >> >>> >> Hi
>> >> >> >>> >>
>> >> >> >>> >> When a reduction instruction is vectorized in a loop, it
>> >> >> >>> >> will
>> >> >> >>> >> be
>> >> >> >>> >> turned into an instruction with vector operands of the
>> >> >> >>> >> same
>> >> >> >>> >> operation
>> >> >> >>> >> type. This new instruction has a special property that
>> >> >> >>> >> can
>> >> >> >>> >> give us
>> >> >> >>> >> more flexibility during instruction selection later:
>> >> >> >>> >> this
>> >> >> >>> >> operation
>> >> >> >>> >> is valid as long as the reduction of all elements of the
>> >> >> >>> >> result
>> >> >> >>> >> vector is identical to the reduction of all elements of
>> >> >> >>> >> its
>> >> >> >>> >> operands.
>> >> >> >>> >>
>> >> >> >>> >> One example that can benefit this property is SAD (sum
>> >> >> >>> >> of
>> >> >> >>> >> absolute
>> >> >> >>> >> differences) pattern detection in SSE2, which provides a
>> >> >> >>> >> psadbw
>> >> >> >>> >> instruction whose description is shown below:
>> >> >> >>> >>
>> >> >> >>> >> '''
>> >> >> >>> >> psadbw: Compute the absolute differences of packed
>> >> >> >>> >> unsigned
>> >> >> >>> >> 8-bit
>> >> >> >>> >> integers in a and b, then horizontally sum each
>> >> >> >>> >> consecutive
>> >> >> >>> >> 8
>> >> >> >>> >> differences to produce two unsigned 16-bit integers, and
>> >> >> >>> >> pack
>> >> >> >>> >> these
>> >> >> >>> >> unsigned 16-bit integers in the low 16 bits of 64-bit
>> >> >> >>> >> elements
>> >> >> >>> >> in dst.
>> >> >> >>> >> '''
>> >> >> >>> >>
>> >> >> >>> >> In LLVM's IR, for a SAD loop we will have two v4i8 as
>> >> >> >>> >> inputs
>> >> >> >>> >> and one
>> >> >> >>> >> v4i32 as output. However, psadbw will actually produce
>> >> >> >>> >> one
>> >> >> >>> >> i32
>> >> >> >>> >> result
>> >> >> >>> >> for four pairs of 8-bit integers (an already reduced
>> >> >> >>> >> result),
>> >> >> >>> >> and the
>> >> >> >>> >> result is stored in the first element in v4i32. If we
>> >> >> >>> >> properly
>> >> >> >>> >> zero
>> >> >> >>> >> out the other three elements in v4i32, and with the
>> >> >> >>> >> information that
>> >> >> >>> >> we have a reduction add that is performed on this
>> >> >> >>> >> result,
>> >> >> >>> >> then
>> >> >> >>> >> we can
>> >> >> >>> >> safely use psadbw here for much better performance. This
>> >> >> >>> >> can
>> >> >> >>> >> be done
>> >> >> >>> >> during DAG combine. Another similar example is dot
>> >> >> >>> >> product.
>> >> >> >>> >> And I
>> >> >> >>> >> think there may be many other scenarios that can benefit
>> >> >> >>> >> from
>> >> >> >>> >> this
>> >> >> >>> >> property like eliminating redundant shuffles.
>> >> >> >>> >>
>> >> >> >>> >> The question is, how to let DAG combiner know that a
>> >> >> >>> >> vector
>> >> >> >>> >> operation
>> >> >> >>> >> is a reduction one?
>> >> >> >>> >>
>> >> >> >>> >> Here I propose to introduce a "reduction add"
>> >> >> >>> >> instruction
>> >> >> >>> >> for
>> >> >> >>> >> vectors.
>> >> >> >>> >> This will be a new instruction with vector operands
>> >> >> >>> >> only.
>> >> >> >>> >> Normally it
>> >> >> >>> >> is treated as a normal ADD operation, but the selection
>> >> >> >>> >> DAG
>> >> >> >>> >> combiner
>> >> >> >>> >> can make use of this new operation to generate better
>> >> >> >>> >> instructions.
>> >> >> >>> >> This new instruction is generated when vectorizing
>> >> >> >>> >> reduction
>> >> >> >>> >> add in
>> >> >> >>> >> loop vectorizer.
>> >> >> >>> >>
>> >> >> >>> >> I would like to hear more comments on this proposal or
>> >> >> >>> >> suggestions of
>> >> >> >>> >> better alternative implementations.
>> >> >> >>> >>
>> >> >> >>> >>
>> >> >> >>> >> thanks,
>> >> >> >>> >> Cong
>> >> >> >>> >> _______________________________________________
>> >> >> >>> >> LLVM Developers mailing list
>> >> >> >>> >> llvm-dev at lists.llvm.org
>> >> >> >>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >> >> _______________________________________________
>> >> >> LLVM Developers mailing list
>> >> >> llvm-dev at lists.llvm.org
>> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >> >>
>> >> >
>> >> > --
>> >> > Hal Finkel
>> >> > Assistant Computational Scientist
>> >> > Leadership Computing Facility
>> >> > Argonne National Laboratory
>> >>
>> >>
>> >
>> > --
>> > Hal Finkel
>> > Assistant Computational Scientist
>> > Leadership Computing Facility
>> > Argonne National Laboratory
>>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory