[PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.
Shahid, Asghar-ahmad via llvm-commits
llvm-commits at lists.llvm.org
Thu Oct 1 01:32:30 PDT 2015
Hi Hal,
Reply inlined.
> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: Thursday, October 01, 2015 4:14 AM
> To: Shahid, Asghar-ahmad
> Cc: llvm-commits at lists.llvm.org;
> reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org; renato golin;
> james molloy
> Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> corresponding SDNodes for horizontal sum operation.
>
> ----- Original Message -----
> > From: "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: llvm-commits at lists.llvm.org,
> reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org, "renato
> golin"
> > <renato.golin at linaro.org>, "james molloy" <james.molloy at arm.com>
> > Sent: Wednesday, September 30, 2015 1:19:06 AM
> > Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> corresponding SDNodes for horizontal sum operation.
> >
> > Hi Hal,
> >
> > Let me present the full context. For below example:
> > ===============================================
> > unsigned sad_fun(unsigned char *a1, unsigned char *a2) { unsigned sum
> > =0;
> > for(int i=0; i<16; i++) {
> > sum += abs(a1[i] - a2[i]);
> > }
> > return sum;
> > }
> > ===============================================
> >
> > LLVM IR:
> >
> > ; ModuleID = 'sad_1.c'
> > target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
> > target triple = "x86_64-unknown-linux-gnu"
> >
> > ; Function Attrs: nounwind readonly uwtable define i32 @sad_fun(i8*
> > nocapture readonly %a1, i8* nocapture readonly %a2) #0 {
> > entry:
> > br label %vector.body
> >
> > vector.body: ; preds = %entry
> > %0 = bitcast i8* %a1 to <4 x i8>*
> > %wide.load = load <4 x i8>, <4 x i8>* %0, align 1, !tbaa !1
> > %1 = zext <4 x i8> %wide.load to <4 x i32>
> > %2 = bitcast i8* %a2 to <4 x i8>*
> > %wide.load12 = load <4 x i8>, <4 x i8>* %2, align 1, !tbaa !1
> > %3 = zext <4 x i8> %wide.load12 to <4 x i32>
> > %4 = sub nsw <4 x i32> %1, %3
> > %5 = icmp sgt <4 x i32> %4, <i32 -1, i32 -1, i32 -1, i32 -1>
> > %6 = sub nsw <4 x i32> zeroinitializer, %4
> > %7 = select <4 x i1> %5, <4 x i32> %4, <4 x i32> %6
> > %8 = getelementptr inbounds i8, i8* %a1, i64 4
> > %9 = bitcast i8* %8 to <4 x i8>*
> > %wide.load.1 = load <4 x i8>, <4 x i8>* %9, align 1, !tbaa !1
> > %10 = zext <4 x i8> %wide.load.1 to <4 x i32>
> > %11 = getelementptr inbounds i8, i8* %a2, i64 4
> > %12 = bitcast i8* %11 to <4 x i8>*
> > %wide.load12.1 = load <4 x i8>, <4 x i8>* %12, align 1, !tbaa !1
> > %13 = zext <4 x i8> %wide.load12.1 to <4 x i32>
> > %14 = sub nsw <4 x i32> %10, %13
> > %15 = icmp sgt <4 x i32> %14, <i32 -1, i32 -1, i32 -1, i32 -1>
> > %16 = sub nsw <4 x i32> zeroinitializer, %14
> > %17 = select <4 x i1> %15, <4 x i32> %14, <4 x i32> %16
> > %18 = add nsw <4 x i32> %17, %7
> > %19 = getelementptr inbounds i8, i8* %a1, i64 8
> > %20 = bitcast i8* %19 to <4 x i8>*
> > %wide.load.2 = load <4 x i8>, <4 x i8>* %20, align 1, !tbaa !1
> > %21 = zext <4 x i8> %wide.load.2 to <4 x i32>
> > %22 = getelementptr inbounds i8, i8* %a2, i64 8
> > %23 = bitcast i8* %22 to <4 x i8>*
> > %wide.load12.2 = load <4 x i8>, <4 x i8>* %23, align 1, !tbaa !1
> > %24 = zext <4 x i8> %wide.load12.2 to <4 x i32>
> > %25 = sub nsw <4 x i32> %21, %24
> > %26 = icmp sgt <4 x i32> %25, <i32 -1, i32 -1, i32 -1, i32 -1>
> > %27 = sub nsw <4 x i32> zeroinitializer, %25
> > %28 = select <4 x i1> %26, <4 x i32> %25, <4 x i32> %27
> > %29 = add nsw <4 x i32> %28, %18
> > %30 = getelementptr inbounds i8, i8* %a1, i64 12
> > %31 = bitcast i8* %30 to <4 x i8>*
> > %wide.load.3 = load <4 x i8>, <4 x i8>* %31, align 1, !tbaa !1
> > %32 = zext <4 x i8> %wide.load.3 to <4 x i32>
> > %33 = getelementptr inbounds i8, i8* %a2, i64 12
> > %34 = bitcast i8* %33 to <4 x i8>*
> > %wide.load12.3 = load <4 x i8>, <4 x i8>* %34, align 1, !tbaa !1
> > %35 = zext <4 x i8> %wide.load12.3 to <4 x i32>
> > %36 = sub nsw <4 x i32> %32, %35
> > %37 = icmp sgt <4 x i32> %36, <i32 -1, i32 -1, i32 -1, i32 -1>
> > %38 = sub nsw <4 x i32> zeroinitializer, %36
> > %39 = select <4 x i1> %37, <4 x i32> %36, <4 x i32> %38
> > %40 = add nsw <4 x i32> %39, %29
> > %rdx.shuf = shufflevector <4 x i32> %40, <4 x i32> undef, <4 x i32>
> > <i32 2, i32 3, i32 undef, i32 undef>
> > %bin.rdx = add <4 x i32> %40, %rdx.shuf
> > %rdx.shuf13 = shufflevector <4 x i32> %bin.rdx, <4 x i32> undef, <4
> > x i32> <i32 1, i32 undef, i32 undef, i32 undef>
> > %bin.rdx14 = add <4 x i32> %bin.rdx, %rdx.shuf13
> > %41 = extractelement <4 x i32> %bin.rdx14, i32 0
> > ret i32 %41
> > }
> >
> > In the above LLVM IR, vectorizer has vectorized the SAD pattern with
> > vector type v4xi8.
> > For example on X86, we want to generate "v16xi8" version of *PSAD*
> > instruction. However to do this we need to catch the full chain of
> > vectorized absolute diffrence and the reduction sum.
> > We found this is difficult to pattern match in codegen and thought to
> > go the intrinsic way.
>
> Put you need the vectorizer to vectorize in terms of v16xi8 regardless of how
> you match the sum. You say that the pattern is difficult to match, but do you
> mean matching it with the right or the wrong vectorization factor?
I mean matching with the right vectorization factor for a given target.
>
> > We also realized
> > That with intrinsic we can also use SLP to vectorize the unrolled
> > pattern if required. Then we simplified the SAD intrinsic operation
> > into two intrinsic which can be utilized to generate "SAD" operation
> > as well as in different context. For example "hsum", as mentioned by
> > Bruno, can be used in context of POPCNT operation.
> >
> > > No; vectorization is not special in that regard, and needs to be
> > > semantics preserving. We have 'fast math' flags for floating-point
> > > operations, which help there, but there are no such flags for
> > > integer operations.
> > Do we need ordering definition for integer ADD operation?
> >
> > > Do the underlying hardware operations on x86 (etc.) really not
> > > define an ordering for the horizontal adds?
> > X86 *PSAD*, which is for integer type does the horizontal add on
> > intermediate result, does not define ordering.
>
> Do you mean that the intermediate accumulator has a larger number of bits
> than each individual vector element?
No, with intermediate result I meant the result of the "Absolute Difference" operation
whose width is same as individual vector element.
>
> -Hal
>
> >
> > Regards,
> > Shahid
> >
> > > -----Original Message-----
> > > From: Hal Finkel [mailto:hfinkel at anl.gov]
> > > Sent: Wednesday, September 30, 2015 3:42 AM
> > > To: Shahid, Asghar-ahmad
> > > Cc: llvm-commits at lists.llvm.org;
> > > reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org; renato
> > > golin;
> > > james molloy
> > > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> > > corresponding SDNodes for horizontal sum operation.
> > >
> > > ----- Original Message -----
> > > > From: "Asghar-ahmad via llvm-commits Shahid"
> > > > <llvm-commits at lists.llvm.org>
> > > > To: reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org,
> > > > "renato
> > > golin" <renato.golin at linaro.org>, "james molloy"
> > > > <james.molloy at arm.com>
> > > > Cc: llvm-commits at lists.llvm.org
> > > > Sent: Sunday, September 27, 2015 11:42:46 PM
> > > > Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> > > corresponding SDNodes for horizontal sum operation.
> > > >
> > > > Hi Hal,
> > > >
> > > > Response inlined.
> > > >
> > > > Regards,
> > > > Shahid
> > > >
> > > > > -----Original Message-----
> > > > > From: hfinkel at anl.gov [mailto:hfinkel at anl.gov]
> > > > > Sent: Saturday, September 26, 2015 2:58 AM
> > > > > To: Shahid, Asghar-ahmad; renato.golin at linaro.org;
> > > > > hfinkel at anl.gov; james.molloy at arm.com
> > > > > Cc: llvm-dev at redking.me.uk; bruno.cardoso at gmail.com; llvm-
> > > > > commits at lists.llvm.org
> > > > > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*'
> > > > > and
> > > > > corresponding SDNodes for horizontal sum operation.
> > > > >
> > > > > hfinkel added a comment.
> > > > >
> > > > > I need to take a step back here; why are we doing this again?
> > > > >
> > > > > I read again the RFC thread
> > > > > (http://lists.llvm.org/pipermail/llvm-dev/2015-
> > > > > May/085078.html), and it ended with the following (from
> > > > > Renato):
> > > > >
> > > > > > > BTW, now my plan is to just add the two intrinsics for
> > > > > > > 'absolute
> > > > > difference'
> > > > >
> > > > > >
> > > > >
> > > > > > > and 'horizontal add'.
> > > > >
> > > > > >
> > > > >
> > > > > >
> > > > >
> > > > > > That's ok, as long as they're impossible to represent in plain
> > > > > > IR.
> > > > >
> > > > >
> > > > > and I think that we were all in agreement on this point. But now
> > > > > I'm not sure you've demonstrated the prerequisite.
> > > > Earlier to this RFC there was a discussion around a patch
> > > > (http://reviews.llvm.org/D9029)
> > > > regarding llvm.sad intrinsic which highlights the need to have
> > > > llvm.hsum and friend intrinsics. This RFC was a result of that
> > > > discussion.
> > > >
> > > > > The underlying operations here (and in
> > > > > http://reviews.llvm.org/D10867) seem like they are representable
> > > > > using IR (as demonstrated by the fact that you provide
> > > > > potentially-equivalent IR sequences in the documentation),
> > > > > except for the ordering freedom here.
> > > >
> > > > Of course the logical expansion of the underlying operation is
> > > > trivially representable but not the pattern matching in codegen.
> > > > Actually pattern matching is complex and may well be fragile in
> > > > codegen due to the long chain of generic IR to be matched.
> > > > Moreover movements of truncate/extends ops in DAG increase the
> > > > complexity of matching further.
> > >
> > > It might, but it is not clearly bad enough to require an alternative
> > > solution.
> > >
> > > In my experience, the patterns that get difficult to match at the
> > > SDAG level are those that consist, in whole or in part, of
> > > operations that are illegal (either involving illegal types or
> > > illegal operations). This becomes difficult because you either need
> > > to match the post-legalized nodes (which might involve undoing
> > > scalarization, etc.) or you need to match the nodes pre-legalization
> > > (where you cannot depend on the canonicalization that restricts the
> > > state space later). Here, however, we seem to be dealing with
> > > patterns of otherwise legal nodes, on otherwise legal vector types,
> > > and while the matching might not be trivial, it is not obviously
> > > difficult in *ISelLowering.cpp either.
> > >
> > > The other issue that I've run into with SDAG pattern matching is the
> > > fact that it is basic-block local, and some of the operations you
> > > want to match may have been CSE'd or otherwise hoisted into
> > > dominating blocks. We generally put code in CGP to deal with these
> > > kinds of things.
> > >
> > > One advantage of putting the code into SDAG pattern matching is that
> > > the patterns get used regardless of where the code originates
> > > (vectorizers or manually-written vector code). This is important.
> > >
> > > >
> > > > >
> > > > > And this, I fear, is where we run into trouble. The thing that
> > > > > is
> > > > > not representable in the IR is that the order of operations in
> > > > > the
> > > > > horizontal sum is undefined, and at the same time, signed
> > > > > overflow
> > > > > is undefined. This cannot be represented in the IR because nsw
> > > > > adds
> > > > > don't reassociate, and thus, there's no way to represent the
> > > > > sequence of nsw adds such that they can be reassociated while
> > > > > retaining their nsw property. But the problem is that, because
> > > > > this
> > > > > freedom cannot be represented in the IR, we also can't generate
> > > > > it
> > > > > from IR in a semantics-preserving way; and, thus, it would not
> > > > > be
> > > > > legal to generate it in the vectorizers.
> > > >
> > > > As proposed, we are not going to generate it always in vectorizer
> > > > but
> > > > only based on target checks, can't we also check for ordering
> > > > relaxation there?
> > > > Or fix the ordering
> > > > constraint by some other means.
> > >
> > > No; vectorization is not special in that regard, and needs to be
> > > semantics
> > > preserving. We have 'fast math' flags for floating-point
> > > operations, which
> > > help there, but there are no such flags for integer operations.
> > >
> > > Do the underlying hardware operations on x86 (etc.) really not
> > > define an
> > > ordering for the horizontal adds?
> > >
> > > >
> > > > >
> > > > > Thus, this change does not seem right, and approving
> > > > > http://reviews.llvm.org/D10867 seems like a mistake as well. We
> > > > > could certainly fix the definition here to make it exactly
> > > > > representable in the IR, but then what's the point of doing so?
> > > > >
> > > > > In the RFC, you mentioned cost modeling as a major motivation
> > > > > for
> > > > > adding intrinsics, but that seems like an unrelated problem (at
> > > > > least in part).
> > > >
> > > > In RFC, I had also proposed llvm.sad intrinsic which I thought
> > > > would
> > > > make cost modelling easier for those targets which support sad
> > > > operation natively. Hence The motivation. But Renato's
> > > > explanation
> > > > clarified the things related to cost modeling. However the
> > > > complexity
> > > > of pattern matching was always a bigger issue.
> > >
> > > Perhaps, but it is localized. And matching adds and subtracts plus
> > > shuffles
> > > does not seem like it should be unusually painful.
> > >
> > > >
> > > > > During
> > > > > vectorization, we can use special interfaces to estimate the
> > > > > cost of
> > > > > complex patterns. In fact, we already have an interface for
> > > > > reductions:
> > > > > TTI.getReductionCost. There is a second relevant code model:
> > > > > That
> > > > > used by the unroller and inliner. Vectorization happens after
> > > > > inlining, so that interaction is not really relevant, but
> > > > > partial
> > > > > unrolling happens after vectorization, and so the cost model
> > > > > there
> > > > > might want to understand that a complex sequence of shuffles,
> > > > > extracts and adds has a
> > > > > disproportionately-
> > > > > low cost. The same is true of the inliner if the input IR uses
> > > > > vector types and initially contains these operations, but even
> > > > > in
> > > > > that case, you'd not want to canonicalize on the intrinsics too
> > > > > early in case other optimizations remove the need for most of
> > > > > the
> > > > > operations. Thus, in the end, you need
> > > > > pattern-
> > > > > matching code near the end of the pipeline anyway to account
> > > > > for
> > > > > input IR directly containing relevant operations on vector
> > > > > types.
> > > > >
> > > > > In short, I don't understand why we're going in this direction.
> > > > > You
> > > > > can match these operations in the backend, and you can match
> > > > > them in
> > > > > the cost-model code. If we can't do the latter, then we should
> > > > > fix
> > > > > that. And we already have special interfaces for vectorization
> > > > > costs
> > > > > for complex operations such as this.
> > > > >
> > > > >
> > > > > http://reviews.llvm.org/D10964
> > > > >
> > > > >
> > > >
> > > > _______________________________________________
> > > > llvm-commits mailing list
> > > > llvm-commits at lists.llvm.org
> > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
> > > >
> > >
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
More information about the llvm-commits
mailing list