[PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.

Thu Oct 1 01:32:30 PDT 2015

Hi Hal,

Reply inlined.

> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: Thursday, October 01, 2015 4:14 AM
> To: Shahid, Asghar-ahmad
> Cc: llvm-commits at lists.llvm.org;
> reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org; renato golin;
> james molloy
> Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> corresponding SDNodes for horizontal sum operation.
> 
> ----- Original Message -----
> > From: "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: llvm-commits at lists.llvm.org,
> reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org, "renato
> golin"
> > <renato.golin at linaro.org>, "james molloy" <james.molloy at arm.com>
> > Sent: Wednesday, September 30, 2015 1:19:06 AM
> > Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> corresponding SDNodes for horizontal sum operation.
> >
> > Hi Hal,
> >
> > Let me present the full context. For below example:
> > ===============================================
> > unsigned sad_fun(unsigned char *a1, unsigned char *a2)  { unsigned sum
> > =0;
> >     for(int i=0; i<16; i++) {
> >       sum += abs(a1[i] - a2[i]);
> >     }
> >   return sum;
> > }
> > ===============================================
> >
> > LLVM IR:
> >
> > ; ModuleID = 'sad_1.c'
> > target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
> > target triple = "x86_64-unknown-linux-gnu"
> >
> > ; Function Attrs: nounwind readonly uwtable define i32 @sad_fun(i8*
> > nocapture readonly %a1, i8* nocapture readonly %a2) #0 {
> > entry:
> >   br label %vector.body
> >
> > vector.body:                                      ; preds = %entry
> >   %0 = bitcast i8* %a1 to <4 x i8>*
> >   %wide.load = load <4 x i8>, <4 x i8>* %0, align 1, !tbaa !1
> >   %1 = zext <4 x i8> %wide.load to <4 x i32>
> >   %2 = bitcast i8* %a2 to <4 x i8>*
> >   %wide.load12 = load <4 x i8>, <4 x i8>* %2, align 1, !tbaa !1
> >   %3 = zext <4 x i8> %wide.load12 to <4 x i32>
> >   %4 = sub nsw <4 x i32> %1, %3
> >   %5 = icmp sgt <4 x i32> %4, <i32 -1, i32 -1, i32 -1, i32 -1>
> >   %6 = sub nsw <4 x i32> zeroinitializer, %4
> >   %7 = select <4 x i1> %5, <4 x i32> %4, <4 x i32> %6
> >   %8 = getelementptr inbounds i8, i8* %a1, i64 4
> >   %9 = bitcast i8* %8 to <4 x i8>*
> >   %wide.load.1 = load <4 x i8>, <4 x i8>* %9, align 1, !tbaa !1
> >   %10 = zext <4 x i8> %wide.load.1 to <4 x i32>
> >   %11 = getelementptr inbounds i8, i8* %a2, i64 4
> >   %12 = bitcast i8* %11 to <4 x i8>*
> >   %wide.load12.1 = load <4 x i8>, <4 x i8>* %12, align 1, !tbaa !1
> >   %13 = zext <4 x i8> %wide.load12.1 to <4 x i32>
> >   %14 = sub nsw <4 x i32> %10, %13
> >   %15 = icmp sgt <4 x i32> %14, <i32 -1, i32 -1, i32 -1, i32 -1>
> >   %16 = sub nsw <4 x i32> zeroinitializer, %14
> >   %17 = select <4 x i1> %15, <4 x i32> %14, <4 x i32> %16
> >   %18 = add nsw <4 x i32> %17, %7
> >   %19 = getelementptr inbounds i8, i8* %a1, i64 8
> >   %20 = bitcast i8* %19 to <4 x i8>*
> >   %wide.load.2 = load <4 x i8>, <4 x i8>* %20, align 1, !tbaa !1
> >   %21 = zext <4 x i8> %wide.load.2 to <4 x i32>
> >   %22 = getelementptr inbounds i8, i8* %a2, i64 8
> >   %23 = bitcast i8* %22 to <4 x i8>*
> >   %wide.load12.2 = load <4 x i8>, <4 x i8>* %23, align 1, !tbaa !1
> >   %24 = zext <4 x i8> %wide.load12.2 to <4 x i32>
> >   %25 = sub nsw <4 x i32> %21, %24
> >   %26 = icmp sgt <4 x i32> %25, <i32 -1, i32 -1, i32 -1, i32 -1>
> >   %27 = sub nsw <4 x i32> zeroinitializer, %25
> >   %28 = select <4 x i1> %26, <4 x i32> %25, <4 x i32> %27
> >   %29 = add nsw <4 x i32> %28, %18
> >   %30 = getelementptr inbounds i8, i8* %a1, i64 12
> >   %31 = bitcast i8* %30 to <4 x i8>*
> >   %wide.load.3 = load <4 x i8>, <4 x i8>* %31, align 1, !tbaa !1
> >   %32 = zext <4 x i8> %wide.load.3 to <4 x i32>
> >   %33 = getelementptr inbounds i8, i8* %a2, i64 12
> >   %34 = bitcast i8* %33 to <4 x i8>*
> >   %wide.load12.3 = load <4 x i8>, <4 x i8>* %34, align 1, !tbaa !1
> >   %35 = zext <4 x i8> %wide.load12.3 to <4 x i32>
> >   %36 = sub nsw <4 x i32> %32, %35
> >   %37 = icmp sgt <4 x i32> %36, <i32 -1, i32 -1, i32 -1, i32 -1>
> >   %38 = sub nsw <4 x i32> zeroinitializer, %36
> >   %39 = select <4 x i1> %37, <4 x i32> %36, <4 x i32> %38
> >   %40 = add nsw <4 x i32> %39, %29
> >   %rdx.shuf = shufflevector <4 x i32> %40, <4 x i32> undef, <4 x i32>
> >   <i32 2, i32 3, i32 undef, i32 undef>
> >   %bin.rdx = add <4 x i32> %40, %rdx.shuf
> >   %rdx.shuf13 = shufflevector <4 x i32> %bin.rdx, <4 x i32> undef, <4
> >   x i32> <i32 1, i32 undef, i32 undef, i32 undef>
> >   %bin.rdx14 = add <4 x i32> %bin.rdx, %rdx.shuf13
> >   %41 = extractelement <4 x i32> %bin.rdx14, i32 0
> >   ret i32 %41
> > }
> >
> > In the above LLVM IR, vectorizer has vectorized the SAD pattern with
> > vector type v4xi8.
> > For example on X86, we want to generate "v16xi8" version of *PSAD*
> > instruction. However to do this we need to catch the full chain of
> > vectorized absolute diffrence and the reduction sum.
> > We found this is difficult to pattern match in codegen and thought to
> > go the intrinsic way.
> 
> Put you need the vectorizer to vectorize in terms of v16xi8 regardless of how
> you match the sum. You say that the pattern is difficult to match, but do you
> mean matching it with the right or the wrong vectorization factor?

I mean matching with the right vectorization factor for a given target.

> 
> > We also realized
> > That with intrinsic we can also use SLP to vectorize the unrolled
> > pattern if required. Then we simplified the SAD intrinsic operation
> > into two intrinsic which can be utilized to generate "SAD" operation
> > as well as in different context. For example "hsum",  as mentioned by
> > Bruno, can be used in context of POPCNT operation.
> >
> > > No; vectorization is not special in that regard, and needs to be
> > > semantics preserving. We have 'fast math' flags for floating-point
> > > operations, which help there, but there are no such flags for
> > > integer operations.
> > Do we need ordering definition for integer ADD operation?
> >
> > > Do the underlying hardware operations on x86 (etc.) really not
> > > define an ordering for the horizontal adds?
> >  X86 *PSAD*,  which is for integer type does the horizontal add on
> > intermediate result, does not define ordering.
> 
> Do you mean that the intermediate accumulator has a larger number of bits
> than each individual vector element?

No, with intermediate result I meant the result of the "Absolute Difference" operation
whose width is same as individual vector element.

> 
>  -Hal
> 
> >
> > Regards,
> > Shahid
> >
> > > -----Original Message-----
> > > From: Hal Finkel [mailto:hfinkel at anl.gov]
> > > Sent: Wednesday, September 30, 2015 3:42 AM
> > > To: Shahid, Asghar-ahmad
> > > Cc: llvm-commits at lists.llvm.org;
> > > reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org; renato
> > > golin;
> > > james molloy
> > > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> > > corresponding SDNodes for horizontal sum operation.
> > >
> > > ----- Original Message -----
> > > > From: "Asghar-ahmad via llvm-commits Shahid"
> > > > <llvm-commits at lists.llvm.org>
> > > > To: reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org,
> > > > "renato
> > > golin" <renato.golin at linaro.org>, "james molloy"
> > > > <james.molloy at arm.com>
> > > > Cc: llvm-commits at lists.llvm.org
> > > > Sent: Sunday, September 27, 2015 11:42:46 PM
> > > > Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> > > corresponding SDNodes for horizontal sum operation.
> > > >
> > > > Hi Hal,
> > > >
> > > > Response inlined.
> > > >
> > > > Regards,
> > > > Shahid
> > > >
> > > > > -----Original Message-----
> > > > > From: hfinkel at anl.gov [mailto:hfinkel at anl.gov]
> > > > > Sent: Saturday, September 26, 2015 2:58 AM
> > > > > To: Shahid, Asghar-ahmad; renato.golin at linaro.org;
> > > > > hfinkel at anl.gov; james.molloy at arm.com
> > > > > Cc: llvm-dev at redking.me.uk; bruno.cardoso at gmail.com; llvm-
> > > > > commits at lists.llvm.org
> > > > > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*'
> > > > > and
> > > > > corresponding SDNodes for horizontal sum operation.
> > > > >
> > > > > hfinkel added a comment.
> > > > >
> > > > > I need to take a step back here; why are we doing this again?
> > > > >
> > > > > I read again the RFC thread
> > > > > (http://lists.llvm.org/pipermail/llvm-dev/2015-
> > > > > May/085078.html), and it ended with the following (from
> > > > > Renato):
> > > > >
> > > > > > > BTW, now my plan is to just add the two intrinsics for
> > > > > > > 'absolute
> > > > > difference'
> > > > >
> > > > > >
> > > > >
> > > > > > >  and 'horizontal add'.
> > > > >
> > > > > >
> > > > >
> > > > > >
> > > > >
> > > > > > That's ok, as long as they're impossible to represent in plain
> > > > > > IR.
> > > > >
> > > > >
> > > > > and I think that we were all in agreement on this point. But now
> > > > > I'm not sure you've demonstrated the prerequisite.
> > > > Earlier to this RFC there was a discussion around a patch
> > > > (http://reviews.llvm.org/D9029)
> > > > regarding llvm.sad intrinsic which highlights the need to have
> > > > llvm.hsum and friend intrinsics. This RFC was a result of that
> > > > discussion.
> > > >
> > > > > The underlying operations here (and in
> > > > > http://reviews.llvm.org/D10867) seem like they are representable
> > > > > using IR (as demonstrated by the fact that you provide
> > > > > potentially-equivalent IR sequences in the documentation),
> > > > > except for the ordering freedom here.
> > > >
> > > > Of course the logical expansion of the underlying operation is
> > > > trivially representable but not the pattern matching in codegen.
> > > > Actually pattern matching is complex and may well be fragile in
> > > > codegen due to the long chain of generic IR to be matched.
> > > > Moreover movements of truncate/extends ops in DAG increase the
> > > > complexity of matching further.
> > >
> > > It might, but it is not clearly bad enough to require an alternative
> > > solution.
> > >
> > > In my experience, the patterns that get difficult to match at the
> > > SDAG level are those that consist, in whole or in part, of
> > > operations that are illegal (either involving illegal types or
> > > illegal operations). This becomes difficult because you either need
> > > to match the post-legalized nodes (which might involve undoing
> > > scalarization, etc.) or you need to match the nodes pre-legalization
> > > (where you cannot depend on the canonicalization that restricts the
> > > state space later). Here, however, we seem to be dealing with
> > > patterns of otherwise legal nodes, on otherwise legal vector types,
> > > and while the matching might not be trivial, it is not obviously
> > > difficult in *ISelLowering.cpp either.
> > >
> > > The other issue that I've run into with SDAG pattern matching is the
> > > fact that it is basic-block local, and some of the operations you
> > > want to match may have been CSE'd or otherwise hoisted into
> > > dominating blocks. We generally put code in CGP to deal with these
> > > kinds of things.
> > >
> > > One advantage of putting the code into SDAG pattern matching is that
> > > the patterns get used regardless of where the code originates
> > > (vectorizers or manually-written vector code). This is important.
> > >
> > > >
> > > > >
> > > > > And this, I fear, is where we run into trouble. The thing that
> > > > > is
> > > > > not representable in the IR is that the order of operations in
> > > > > the
> > > > > horizontal sum is undefined, and at the same time, signed
> > > > > overflow
> > > > > is undefined. This cannot be represented in the IR because nsw
> > > > > adds
> > > > > don't reassociate, and thus, there's no way to represent the
> > > > > sequence of nsw adds such that they can be reassociated while
> > > > > retaining their nsw property. But the problem is that, because
> > > > > this
> > > > > freedom cannot be represented in the IR, we also can't generate
> > > > > it
> > > > > from IR in a semantics-preserving way; and, thus, it would not
> > > > > be
> > > > > legal to generate it in the vectorizers.
> > > >
> > > > As proposed, we are not going to generate it always in vectorizer
> > > > but
> > > > only based on target checks, can't we also check for ordering
> > > > relaxation there?
> > > > Or fix the ordering
> > > > constraint by some other means.
> > >
> > > No; vectorization is not special in that regard, and needs to be
> > > semantics
> > > preserving. We have 'fast math' flags for floating-point
> > > operations, which
> > > help there, but there are no such flags for integer operations.
> > >
> > > Do the underlying hardware operations on x86 (etc.) really not
> > > define an
> > > ordering for the horizontal adds?
> > >
> > > >
> > > > >
> > > > > Thus, this change does not seem right, and approving
> > > > > http://reviews.llvm.org/D10867 seems like a mistake as well. We
> > > > > could certainly fix the definition here to make it exactly
> > > > > representable in the IR, but then what's the point of doing so?
> > > > >
> > > > > In the RFC, you mentioned cost modeling as a major motivation
> > > > > for
> > > > > adding intrinsics, but that seems like an unrelated problem (at
> > > > > least in part).
> > > >
> > > > In RFC, I had also proposed llvm.sad intrinsic which I thought
> > > > would
> > > > make cost modelling easier for those targets which support sad
> > > > operation natively. Hence The motivation. But Renato's
> > > > explanation
> > > > clarified the things related to cost modeling. However the
> > > > complexity
> > > > of pattern matching was always a bigger issue.
> > >
> > > Perhaps, but it is localized. And matching adds and subtracts plus
> > > shuffles
> > > does not seem like it should be unusually painful.
> > >
> > > >
> > > > > During
> > > > > vectorization, we can use special interfaces to estimate the
> > > > > cost of
> > > > > complex patterns. In fact, we already have an interface for
> > > > > reductions:
> > > > > TTI.getReductionCost. There is a second relevant code model:
> > > > > That
> > > > > used by the unroller and inliner. Vectorization happens after
> > > > > inlining, so that interaction is not really relevant, but
> > > > > partial
> > > > > unrolling happens after vectorization, and so the cost model
> > > > > there
> > > > > might want to understand that a complex sequence of shuffles,
> > > > > extracts and adds has a
> > > > > disproportionately-
> > > > > low cost. The same is true of the inliner if the input IR uses
> > > > > vector types and initially contains these operations, but even
> > > > > in
> > > > > that case, you'd not want to canonicalize on the intrinsics too
> > > > > early in case other optimizations remove the need for most of
> > > > > the
> > > > > operations. Thus, in the end, you need
> > > > > pattern-
> > > > > matching code near the end of the pipeline anyway to account
> > > > > for
> > > > > input IR directly containing relevant operations on vector
> > > > > types.
> > > > >
> > > > > In short, I don't understand why we're going in this direction.
> > > > > You
> > > > > can match these operations in the backend, and you can match
> > > > > them in
> > > > > the cost-model code. If we can't do the latter, then we should
> > > > > fix
> > > > > that. And we already have special interfaces for vectorization
> > > > > costs
> > > > > for complex operations such as this.
> > > > >
> > > > >
> > > > > http://reviews.llvm.org/D10964
> > > > >
> > > > >
> > > >
> > > > _______________________________________________
> > > > llvm-commits mailing list
> > > > llvm-commits at lists.llvm.org
> > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
> > > >
> > >
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> >
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory