[PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.

Thu Oct 1 05:27:54 PDT 2015

----- Original Message -----
> From: "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: llvm-commits at lists.llvm.org, reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org, "renato golin"
> <renato.golin at linaro.org>, "james molloy" <james.molloy at arm.com>
> Sent: Thursday, October 1, 2015 3:32:30 AM
> Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.
> 
> Hi Hal,
> 
> Reply inlined.
> 
> > -----Original Message-----
> > From: Hal Finkel [mailto:hfinkel at anl.gov]
> > Sent: Thursday, October 01, 2015 4:14 AM
> > To: Shahid, Asghar-ahmad
> > Cc: llvm-commits at lists.llvm.org;
> > reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org; renato
> > golin;
> > james molloy
> > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> > corresponding SDNodes for horizontal sum operation.
> > 
> > ----- Original Message -----
> > > From: "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>
> > > To: "Hal Finkel" <hfinkel at anl.gov>
> > > Cc: llvm-commits at lists.llvm.org,
> > reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org, "renato
> > golin"
> > > <renato.golin at linaro.org>, "james molloy" <james.molloy at arm.com>
> > > Sent: Wednesday, September 30, 2015 1:19:06 AM
> > > Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> > corresponding SDNodes for horizontal sum operation.
> > >
> > > Hi Hal,
> > >
> > > Let me present the full context. For below example:
> > > ===============================================
> > > unsigned sad_fun(unsigned char *a1, unsigned char *a2)  {
> > > unsigned sum
> > > =0;
> > >     for(int i=0; i<16; i++) {
> > >       sum += abs(a1[i] - a2[i]);
> > >     }
> > >   return sum;
> > > }
> > > ===============================================
> > >
> > > LLVM IR:
> > >
> > > ; ModuleID = 'sad_1.c'
> > > target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
> > > target triple = "x86_64-unknown-linux-gnu"
> > >
> > > ; Function Attrs: nounwind readonly uwtable define i32
> > > @sad_fun(i8*
> > > nocapture readonly %a1, i8* nocapture readonly %a2) #0 {
> > > entry:
> > >   br label %vector.body
> > >
> > > vector.body:                                      ; preds =
> > > %entry
> > >   %0 = bitcast i8* %a1 to <4 x i8>*
> > >   %wide.load = load <4 x i8>, <4 x i8>* %0, align 1, !tbaa !1
> > >   %1 = zext <4 x i8> %wide.load to <4 x i32>
> > >   %2 = bitcast i8* %a2 to <4 x i8>*
> > >   %wide.load12 = load <4 x i8>, <4 x i8>* %2, align 1, !tbaa !1
> > >   %3 = zext <4 x i8> %wide.load12 to <4 x i32>
> > >   %4 = sub nsw <4 x i32> %1, %3
> > >   %5 = icmp sgt <4 x i32> %4, <i32 -1, i32 -1, i32 -1, i32 -1>
> > >   %6 = sub nsw <4 x i32> zeroinitializer, %4
> > >   %7 = select <4 x i1> %5, <4 x i32> %4, <4 x i32> %6
> > >   %8 = getelementptr inbounds i8, i8* %a1, i64 4
> > >   %9 = bitcast i8* %8 to <4 x i8>*
> > >   %wide.load.1 = load <4 x i8>, <4 x i8>* %9, align 1, !tbaa !1
> > >   %10 = zext <4 x i8> %wide.load.1 to <4 x i32>
> > >   %11 = getelementptr inbounds i8, i8* %a2, i64 4
> > >   %12 = bitcast i8* %11 to <4 x i8>*
> > >   %wide.load12.1 = load <4 x i8>, <4 x i8>* %12, align 1, !tbaa
> > >   !1
> > >   %13 = zext <4 x i8> %wide.load12.1 to <4 x i32>
> > >   %14 = sub nsw <4 x i32> %10, %13
> > >   %15 = icmp sgt <4 x i32> %14, <i32 -1, i32 -1, i32 -1, i32 -1>
> > >   %16 = sub nsw <4 x i32> zeroinitializer, %14
> > >   %17 = select <4 x i1> %15, <4 x i32> %14, <4 x i32> %16
> > >   %18 = add nsw <4 x i32> %17, %7
> > >   %19 = getelementptr inbounds i8, i8* %a1, i64 8
> > >   %20 = bitcast i8* %19 to <4 x i8>*
> > >   %wide.load.2 = load <4 x i8>, <4 x i8>* %20, align 1, !tbaa !1
> > >   %21 = zext <4 x i8> %wide.load.2 to <4 x i32>
> > >   %22 = getelementptr inbounds i8, i8* %a2, i64 8
> > >   %23 = bitcast i8* %22 to <4 x i8>*
> > >   %wide.load12.2 = load <4 x i8>, <4 x i8>* %23, align 1, !tbaa
> > >   !1
> > >   %24 = zext <4 x i8> %wide.load12.2 to <4 x i32>
> > >   %25 = sub nsw <4 x i32> %21, %24
> > >   %26 = icmp sgt <4 x i32> %25, <i32 -1, i32 -1, i32 -1, i32 -1>
> > >   %27 = sub nsw <4 x i32> zeroinitializer, %25
> > >   %28 = select <4 x i1> %26, <4 x i32> %25, <4 x i32> %27
> > >   %29 = add nsw <4 x i32> %28, %18
> > >   %30 = getelementptr inbounds i8, i8* %a1, i64 12
> > >   %31 = bitcast i8* %30 to <4 x i8>*
> > >   %wide.load.3 = load <4 x i8>, <4 x i8>* %31, align 1, !tbaa !1
> > >   %32 = zext <4 x i8> %wide.load.3 to <4 x i32>
> > >   %33 = getelementptr inbounds i8, i8* %a2, i64 12
> > >   %34 = bitcast i8* %33 to <4 x i8>*
> > >   %wide.load12.3 = load <4 x i8>, <4 x i8>* %34, align 1, !tbaa
> > >   !1
> > >   %35 = zext <4 x i8> %wide.load12.3 to <4 x i32>
> > >   %36 = sub nsw <4 x i32> %32, %35
> > >   %37 = icmp sgt <4 x i32> %36, <i32 -1, i32 -1, i32 -1, i32 -1>
> > >   %38 = sub nsw <4 x i32> zeroinitializer, %36
> > >   %39 = select <4 x i1> %37, <4 x i32> %36, <4 x i32> %38
> > >   %40 = add nsw <4 x i32> %39, %29
> > >   %rdx.shuf = shufflevector <4 x i32> %40, <4 x i32> undef, <4 x
> > >   i32>
> > >   <i32 2, i32 3, i32 undef, i32 undef>
> > >   %bin.rdx = add <4 x i32> %40, %rdx.shuf
> > >   %rdx.shuf13 = shufflevector <4 x i32> %bin.rdx, <4 x i32>
> > >   undef, <4
> > >   x i32> <i32 1, i32 undef, i32 undef, i32 undef>
> > >   %bin.rdx14 = add <4 x i32> %bin.rdx, %rdx.shuf13
> > >   %41 = extractelement <4 x i32> %bin.rdx14, i32 0
> > >   ret i32 %41
> > > }
> > >
> > > In the above LLVM IR, vectorizer has vectorized the SAD pattern
> > > with
> > > vector type v4xi8.
> > > For example on X86, we want to generate "v16xi8" version of
> > > *PSAD*
> > > instruction. However to do this we need to catch the full chain
> > > of
> > > vectorized absolute diffrence and the reduction sum.
> > > We found this is difficult to pattern match in codegen and
> > > thought to
> > > go the intrinsic way.
> > 
> > Put you need the vectorizer to vectorize in terms of v16xi8
> > regardless of how
> > you match the sum. You say that the pattern is difficult to match,
> > but do you
> > mean matching it with the right or the wrong vectorization factor?
> 
> I mean matching with the right vectorization factor for a given
> target.

Good.

This pattern does not look hard to match in the backend, albeit a bit tedious. Did you try and run into problems?

> 
> > 
> > > We also realized
> > > That with intrinsic we can also use SLP to vectorize the unrolled
> > > pattern if required. Then we simplified the SAD intrinsic
> > > operation
> > > into two intrinsic which can be utilized to generate "SAD"
> > > operation
> > > as well as in different context. For example "hsum",  as
> > > mentioned by
> > > Bruno, can be used in context of POPCNT operation.
> > >
> > > > No; vectorization is not special in that regard, and needs to
> > > > be
> > > > semantics preserving. We have 'fast math' flags for
> > > > floating-point
> > > > operations, which help there, but there are no such flags for
> > > > integer operations.
> > > Do we need ordering definition for integer ADD operation?
> > >
> > > > Do the underlying hardware operations on x86 (etc.) really not
> > > > define an ordering for the horizontal adds?
> > >  X86 *PSAD*,  which is for integer type does the horizontal add
> > >  on
> > > intermediate result, does not define ordering.
> > 
> > Do you mean that the intermediate accumulator has a larger number
> > of bits
> > than each individual vector element?
> 
> No, with intermediate result I meant the result of the "Absolute
> Difference" operation
> whose width is same as individual vector element.

Alright, looking at the documentation for PSADBW, I believe I see the source of our confusion. PSADBW is defined to, "Computes the absolute value of the difference of 8 unsigned byte integers from the source operand (first operand) and from the destination operand (second operand). These 8 differences are then summed to produce an unsigned word integer result that is stored in the destination operand."

So, this, fundamentally, is an *unsigned* i8 operation. If this is what you're trying to model, we should remove from our discussion the topic of nsw-related semantics here. nuw also does not apply. Wrapping of all kinds is perfectly well defined. Given that wrapping is well defined, and that unsigned integer addition is associative (when we don't have wrapping restrictions), we also don't need to discuss the order of the additions either (they're irrelevant; all orderings would be permitted under 'as if' regardless of what we specified).

 -Hal

> 
> > 
> >  -Hal
> > 
> > >
> > > Regards,
> > > Shahid
> > >
> > > > -----Original Message-----
> > > > From: Hal Finkel [mailto:hfinkel at anl.gov]
> > > > Sent: Wednesday, September 30, 2015 3:42 AM
> > > > To: Shahid, Asghar-ahmad
> > > > Cc: llvm-commits at lists.llvm.org;
> > > > reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org; renato
> > > > golin;
> > > > james molloy
> > > > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*'
> > > > and
> > > > corresponding SDNodes for horizontal sum operation.
> > > >
> > > > ----- Original Message -----
> > > > > From: "Asghar-ahmad via llvm-commits Shahid"
> > > > > <llvm-commits at lists.llvm.org>
> > > > > To: reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org,
> > > > > "renato
> > > > golin" <renato.golin at linaro.org>, "james molloy"
> > > > > <james.molloy at arm.com>
> > > > > Cc: llvm-commits at lists.llvm.org
> > > > > Sent: Sunday, September 27, 2015 11:42:46 PM
> > > > > Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*'
> > > > > and
> > > > corresponding SDNodes for horizontal sum operation.
> > > > >
> > > > > Hi Hal,
> > > > >
> > > > > Response inlined.
> > > > >
> > > > > Regards,
> > > > > Shahid
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: hfinkel at anl.gov [mailto:hfinkel at anl.gov]
> > > > > > Sent: Saturday, September 26, 2015 2:58 AM
> > > > > > To: Shahid, Asghar-ahmad; renato.golin at linaro.org;
> > > > > > hfinkel at anl.gov; james.molloy at arm.com
> > > > > > Cc: llvm-dev at redking.me.uk; bruno.cardoso at gmail.com; llvm-
> > > > > > commits at lists.llvm.org
> > > > > > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics
> > > > > > 'hsum*'
> > > > > > and
> > > > > > corresponding SDNodes for horizontal sum operation.
> > > > > >
> > > > > > hfinkel added a comment.
> > > > > >
> > > > > > I need to take a step back here; why are we doing this
> > > > > > again?
> > > > > >
> > > > > > I read again the RFC thread
> > > > > > (http://lists.llvm.org/pipermail/llvm-dev/2015-
> > > > > > May/085078.html), and it ended with the following (from
> > > > > > Renato):
> > > > > >
> > > > > > > > BTW, now my plan is to just add the two intrinsics for
> > > > > > > > 'absolute
> > > > > > difference'
> > > > > >
> > > > > > >
> > > > > >
> > > > > > > >  and 'horizontal add'.
> > > > > >
> > > > > > >
> > > > > >
> > > > > > >
> > > > > >
> > > > > > > That's ok, as long as they're impossible to represent in
> > > > > > > plain
> > > > > > > IR.
> > > > > >
> > > > > >
> > > > > > and I think that we were all in agreement on this point.
> > > > > > But now
> > > > > > I'm not sure you've demonstrated the prerequisite.
> > > > > Earlier to this RFC there was a discussion around a patch
> > > > > (http://reviews.llvm.org/D9029)
> > > > > regarding llvm.sad intrinsic which highlights the need to
> > > > > have
> > > > > llvm.hsum and friend intrinsics. This RFC was a result of
> > > > > that
> > > > > discussion.
> > > > >
> > > > > > The underlying operations here (and in
> > > > > > http://reviews.llvm.org/D10867) seem like they are
> > > > > > representable
> > > > > > using IR (as demonstrated by the fact that you provide
> > > > > > potentially-equivalent IR sequences in the documentation),
> > > > > > except for the ordering freedom here.
> > > > >
> > > > > Of course the logical expansion of the underlying operation
> > > > > is
> > > > > trivially representable but not the pattern matching in
> > > > > codegen.
> > > > > Actually pattern matching is complex and may well be fragile
> > > > > in
> > > > > codegen due to the long chain of generic IR to be matched.
> > > > > Moreover movements of truncate/extends ops in DAG increase
> > > > > the
> > > > > complexity of matching further.
> > > >
> > > > It might, but it is not clearly bad enough to require an
> > > > alternative
> > > > solution.
> > > >
> > > > In my experience, the patterns that get difficult to match at
> > > > the
> > > > SDAG level are those that consist, in whole or in part, of
> > > > operations that are illegal (either involving illegal types or
> > > > illegal operations). This becomes difficult because you either
> > > > need
> > > > to match the post-legalized nodes (which might involve undoing
> > > > scalarization, etc.) or you need to match the nodes
> > > > pre-legalization
> > > > (where you cannot depend on the canonicalization that restricts
> > > > the
> > > > state space later). Here, however, we seem to be dealing with
> > > > patterns of otherwise legal nodes, on otherwise legal vector
> > > > types,
> > > > and while the matching might not be trivial, it is not
> > > > obviously
> > > > difficult in *ISelLowering.cpp either.
> > > >
> > > > The other issue that I've run into with SDAG pattern matching
> > > > is the
> > > > fact that it is basic-block local, and some of the operations
> > > > you
> > > > want to match may have been CSE'd or otherwise hoisted into
> > > > dominating blocks. We generally put code in CGP to deal with
> > > > these
> > > > kinds of things.
> > > >
> > > > One advantage of putting the code into SDAG pattern matching is
> > > > that
> > > > the patterns get used regardless of where the code originates
> > > > (vectorizers or manually-written vector code). This is
> > > > important.
> > > >
> > > > >
> > > > > >
> > > > > > And this, I fear, is where we run into trouble. The thing
> > > > > > that
> > > > > > is
> > > > > > not representable in the IR is that the order of operations
> > > > > > in
> > > > > > the
> > > > > > horizontal sum is undefined, and at the same time, signed
> > > > > > overflow
> > > > > > is undefined. This cannot be represented in the IR because
> > > > > > nsw
> > > > > > adds
> > > > > > don't reassociate, and thus, there's no way to represent
> > > > > > the
> > > > > > sequence of nsw adds such that they can be reassociated
> > > > > > while
> > > > > > retaining their nsw property. But the problem is that,
> > > > > > because
> > > > > > this
> > > > > > freedom cannot be represented in the IR, we also can't
> > > > > > generate
> > > > > > it
> > > > > > from IR in a semantics-preserving way; and, thus, it would
> > > > > > not
> > > > > > be
> > > > > > legal to generate it in the vectorizers.
> > > > >
> > > > > As proposed, we are not going to generate it always in
> > > > > vectorizer
> > > > > but
> > > > > only based on target checks, can't we also check for ordering
> > > > > relaxation there?
> > > > > Or fix the ordering
> > > > > constraint by some other means.
> > > >
> > > > No; vectorization is not special in that regard, and needs to
> > > > be
> > > > semantics
> > > > preserving. We have 'fast math' flags for floating-point
> > > > operations, which
> > > > help there, but there are no such flags for integer operations.
> > > >
> > > > Do the underlying hardware operations on x86 (etc.) really not
> > > > define an
> > > > ordering for the horizontal adds?
> > > >
> > > > >
> > > > > >
> > > > > > Thus, this change does not seem right, and approving
> > > > > > http://reviews.llvm.org/D10867 seems like a mistake as
> > > > > > well. We
> > > > > > could certainly fix the definition here to make it exactly
> > > > > > representable in the IR, but then what's the point of doing
> > > > > > so?
> > > > > >
> > > > > > In the RFC, you mentioned cost modeling as a major
> > > > > > motivation
> > > > > > for
> > > > > > adding intrinsics, but that seems like an unrelated problem
> > > > > > (at
> > > > > > least in part).
> > > > >
> > > > > In RFC, I had also proposed llvm.sad intrinsic which I
> > > > > thought
> > > > > would
> > > > > make cost modelling easier for those targets which support
> > > > > sad
> > > > > operation natively. Hence The motivation. But Renato's
> > > > > explanation
> > > > > clarified the things related to cost modeling. However the
> > > > > complexity
> > > > > of pattern matching was always a bigger issue.
> > > >
> > > > Perhaps, but it is localized. And matching adds and subtracts
> > > > plus
> > > > shuffles
> > > > does not seem like it should be unusually painful.
> > > >
> > > > >
> > > > > > During
> > > > > > vectorization, we can use special interfaces to estimate
> > > > > > the
> > > > > > cost of
> > > > > > complex patterns. In fact, we already have an interface for
> > > > > > reductions:
> > > > > > TTI.getReductionCost. There is a second relevant code
> > > > > > model:
> > > > > > That
> > > > > > used by the unroller and inliner. Vectorization happens
> > > > > > after
> > > > > > inlining, so that interaction is not really relevant, but
> > > > > > partial
> > > > > > unrolling happens after vectorization, and so the cost
> > > > > > model
> > > > > > there
> > > > > > might want to understand that a complex sequence of
> > > > > > shuffles,
> > > > > > extracts and adds has a
> > > > > > disproportionately-
> > > > > > low cost. The same is true of the inliner if the input IR
> > > > > > uses
> > > > > > vector types and initially contains these operations, but
> > > > > > even
> > > > > > in
> > > > > > that case, you'd not want to canonicalize on the intrinsics
> > > > > > too
> > > > > > early in case other optimizations remove the need for most
> > > > > > of
> > > > > > the
> > > > > > operations. Thus, in the end, you need
> > > > > > pattern-
> > > > > > matching code near the end of the pipeline anyway to
> > > > > > account
> > > > > > for
> > > > > > input IR directly containing relevant operations on vector
> > > > > > types.
> > > > > >
> > > > > > In short, I don't understand why we're going in this
> > > > > > direction.
> > > > > > You
> > > > > > can match these operations in the backend, and you can
> > > > > > match
> > > > > > them in
> > > > > > the cost-model code. If we can't do the latter, then we
> > > > > > should
> > > > > > fix
> > > > > > that. And we already have special interfaces for
> > > > > > vectorization
> > > > > > costs
> > > > > > for complex operations such as this.
> > > > > >
> > > > > >
> > > > > > http://reviews.llvm.org/D10964
> > > > > >
> > > > > >
> > > > >
> > > > > _______________________________________________
> > > > > llvm-commits mailing list
> > > > > llvm-commits at lists.llvm.org
> > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
> > > > >
> > > >
> > > > --
> > > > Hal Finkel
> > > > Assistant Computational Scientist
> > > > Leadership Computing Facility
> > > > Argonne National Laboratory
> > >
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory