[PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.

Shahid, Asghar-ahmad via llvm-commits llvm-commits at lists.llvm.org
Thu Oct 1 10:16:56 PDT 2015


Hi Hal,

> This pattern does not look hard to match in the backend, albeit a bit tedious.
> Did you try and run into problems?

No, I did not try that.

Regards,
Shahid

> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: Thursday, October 01, 2015 5:58 PM
> To: Shahid, Asghar-ahmad
> Cc: llvm-commits at lists.llvm.org;
> reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org; renato golin;
> james molloy
> Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> corresponding SDNodes for horizontal sum operation.
> 
> ----- Original Message -----
> > From: "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: llvm-commits at lists.llvm.org,
> reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org, "renato
> golin"
> > <renato.golin at linaro.org>, "james molloy" <james.molloy at arm.com>
> > Sent: Thursday, October 1, 2015 3:32:30 AM
> > Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> corresponding SDNodes for horizontal sum operation.
> >
> > Hi Hal,
> >
> > Reply inlined.
> >
> > > -----Original Message-----
> > > From: Hal Finkel [mailto:hfinkel at anl.gov]
> > > Sent: Thursday, October 01, 2015 4:14 AM
> > > To: Shahid, Asghar-ahmad
> > > Cc: llvm-commits at lists.llvm.org;
> > > reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org; renato
> > > golin;
> > > james molloy
> > > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> > > corresponding SDNodes for horizontal sum operation.
> > >
> > > ----- Original Message -----
> > > > From: "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>
> > > > To: "Hal Finkel" <hfinkel at anl.gov>
> > > > Cc: llvm-commits at lists.llvm.org,
> > > reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org, "renato
> > > golin"
> > > > <renato.golin at linaro.org>, "james molloy" <james.molloy at arm.com>
> > > > Sent: Wednesday, September 30, 2015 1:19:06 AM
> > > > Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> > > corresponding SDNodes for horizontal sum operation.
> > > >
> > > > Hi Hal,
> > > >
> > > > Let me present the full context. For below example:
> > > > ===============================================
> > > > unsigned sad_fun(unsigned char *a1, unsigned char *a2)  { unsigned
> > > > sum =0;
> > > >     for(int i=0; i<16; i++) {
> > > >       sum += abs(a1[i] - a2[i]);
> > > >     }
> > > >   return sum;
> > > > }
> > > > ===============================================
> > > >
> > > > LLVM IR:
> > > >
> > > > ; ModuleID = 'sad_1.c'
> > > > target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
> > > > target triple = "x86_64-unknown-linux-gnu"
> > > >
> > > > ; Function Attrs: nounwind readonly uwtable define i32
> > > > @sad_fun(i8*
> > > > nocapture readonly %a1, i8* nocapture readonly %a2) #0 {
> > > > entry:
> > > >   br label %vector.body
> > > >
> > > > vector.body:                                      ; preds =
> > > > %entry
> > > >   %0 = bitcast i8* %a1 to <4 x i8>*
> > > >   %wide.load = load <4 x i8>, <4 x i8>* %0, align 1, !tbaa !1
> > > >   %1 = zext <4 x i8> %wide.load to <4 x i32>
> > > >   %2 = bitcast i8* %a2 to <4 x i8>*
> > > >   %wide.load12 = load <4 x i8>, <4 x i8>* %2, align 1, !tbaa !1
> > > >   %3 = zext <4 x i8> %wide.load12 to <4 x i32>
> > > >   %4 = sub nsw <4 x i32> %1, %3
> > > >   %5 = icmp sgt <4 x i32> %4, <i32 -1, i32 -1, i32 -1, i32 -1>
> > > >   %6 = sub nsw <4 x i32> zeroinitializer, %4
> > > >   %7 = select <4 x i1> %5, <4 x i32> %4, <4 x i32> %6
> > > >   %8 = getelementptr inbounds i8, i8* %a1, i64 4
> > > >   %9 = bitcast i8* %8 to <4 x i8>*
> > > >   %wide.load.1 = load <4 x i8>, <4 x i8>* %9, align 1, !tbaa !1
> > > >   %10 = zext <4 x i8> %wide.load.1 to <4 x i32>
> > > >   %11 = getelementptr inbounds i8, i8* %a2, i64 4
> > > >   %12 = bitcast i8* %11 to <4 x i8>*
> > > >   %wide.load12.1 = load <4 x i8>, <4 x i8>* %12, align 1, !tbaa
> > > >   !1
> > > >   %13 = zext <4 x i8> %wide.load12.1 to <4 x i32>
> > > >   %14 = sub nsw <4 x i32> %10, %13
> > > >   %15 = icmp sgt <4 x i32> %14, <i32 -1, i32 -1, i32 -1, i32 -1>
> > > >   %16 = sub nsw <4 x i32> zeroinitializer, %14
> > > >   %17 = select <4 x i1> %15, <4 x i32> %14, <4 x i32> %16
> > > >   %18 = add nsw <4 x i32> %17, %7
> > > >   %19 = getelementptr inbounds i8, i8* %a1, i64 8
> > > >   %20 = bitcast i8* %19 to <4 x i8>*
> > > >   %wide.load.2 = load <4 x i8>, <4 x i8>* %20, align 1, !tbaa !1
> > > >   %21 = zext <4 x i8> %wide.load.2 to <4 x i32>
> > > >   %22 = getelementptr inbounds i8, i8* %a2, i64 8
> > > >   %23 = bitcast i8* %22 to <4 x i8>*
> > > >   %wide.load12.2 = load <4 x i8>, <4 x i8>* %23, align 1, !tbaa
> > > >   !1
> > > >   %24 = zext <4 x i8> %wide.load12.2 to <4 x i32>
> > > >   %25 = sub nsw <4 x i32> %21, %24
> > > >   %26 = icmp sgt <4 x i32> %25, <i32 -1, i32 -1, i32 -1, i32 -1>
> > > >   %27 = sub nsw <4 x i32> zeroinitializer, %25
> > > >   %28 = select <4 x i1> %26, <4 x i32> %25, <4 x i32> %27
> > > >   %29 = add nsw <4 x i32> %28, %18
> > > >   %30 = getelementptr inbounds i8, i8* %a1, i64 12
> > > >   %31 = bitcast i8* %30 to <4 x i8>*
> > > >   %wide.load.3 = load <4 x i8>, <4 x i8>* %31, align 1, !tbaa !1
> > > >   %32 = zext <4 x i8> %wide.load.3 to <4 x i32>
> > > >   %33 = getelementptr inbounds i8, i8* %a2, i64 12
> > > >   %34 = bitcast i8* %33 to <4 x i8>*
> > > >   %wide.load12.3 = load <4 x i8>, <4 x i8>* %34, align 1, !tbaa
> > > >   !1
> > > >   %35 = zext <4 x i8> %wide.load12.3 to <4 x i32>
> > > >   %36 = sub nsw <4 x i32> %32, %35
> > > >   %37 = icmp sgt <4 x i32> %36, <i32 -1, i32 -1, i32 -1, i32 -1>
> > > >   %38 = sub nsw <4 x i32> zeroinitializer, %36
> > > >   %39 = select <4 x i1> %37, <4 x i32> %36, <4 x i32> %38
> > > >   %40 = add nsw <4 x i32> %39, %29
> > > >   %rdx.shuf = shufflevector <4 x i32> %40, <4 x i32> undef, <4 x
> > > >   i32>
> > > >   <i32 2, i32 3, i32 undef, i32 undef>
> > > >   %bin.rdx = add <4 x i32> %40, %rdx.shuf
> > > >   %rdx.shuf13 = shufflevector <4 x i32> %bin.rdx, <4 x i32>
> > > >   undef, <4
> > > >   x i32> <i32 1, i32 undef, i32 undef, i32 undef>
> > > >   %bin.rdx14 = add <4 x i32> %bin.rdx, %rdx.shuf13
> > > >   %41 = extractelement <4 x i32> %bin.rdx14, i32 0
> > > >   ret i32 %41
> > > > }
> > > >
> > > > In the above LLVM IR, vectorizer has vectorized the SAD pattern
> > > > with vector type v4xi8.
> > > > For example on X86, we want to generate "v16xi8" version of
> > > > *PSAD*
> > > > instruction. However to do this we need to catch the full chain of
> > > > vectorized absolute diffrence and the reduction sum.
> > > > We found this is difficult to pattern match in codegen and thought
> > > > to go the intrinsic way.
> > >
> > > Put you need the vectorizer to vectorize in terms of v16xi8
> > > regardless of how you match the sum. You say that the pattern is
> > > difficult to match, but do you mean matching it with the right or
> > > the wrong vectorization factor?
> >
> > I mean matching with the right vectorization factor for a given
> > target.
> 
> Good.
> 
> This pattern does not look hard to match in the backend, albeit a bit tedious.
> Did you try and run into problems?
> 
> >
> > >
> > > > We also realized
> > > > That with intrinsic we can also use SLP to vectorize the unrolled
> > > > pattern if required. Then we simplified the SAD intrinsic
> > > > operation into two intrinsic which can be utilized to generate
> > > > "SAD"
> > > > operation
> > > > as well as in different context. For example "hsum",  as mentioned
> > > > by Bruno, can be used in context of POPCNT operation.
> > > >
> > > > > No; vectorization is not special in that regard, and needs to be
> > > > > semantics preserving. We have 'fast math' flags for
> > > > > floating-point operations, which help there, but there are no
> > > > > such flags for integer operations.
> > > > Do we need ordering definition for integer ADD operation?
> > > >
> > > > > Do the underlying hardware operations on x86 (etc.) really not
> > > > > define an ordering for the horizontal adds?
> > > >  X86 *PSAD*,  which is for integer type does the horizontal add
> > > > on intermediate result, does not define ordering.
> > >
> > > Do you mean that the intermediate accumulator has a larger number of
> > > bits than each individual vector element?
> >
> > No, with intermediate result I meant the result of the "Absolute
> > Difference" operation whose width is same as individual vector
> > element.
> 
> Alright, looking at the documentation for PSADBW, I believe I see the source
> of our confusion. PSADBW is defined to, "Computes the absolute value of
> the difference of 8 unsigned byte integers from the source operand (first
> operand) and from the destination operand (second operand). These 8
> differences are then summed to produce an unsigned word integer result
> that is stored in the destination operand."
> 
> So, this, fundamentally, is an *unsigned* i8 operation. If this is what you're
> trying to model, we should remove from our discussion the topic of nsw-
> related semantics here. nuw also does not apply. Wrapping of all kinds is
> perfectly well defined. Given that wrapping is well defined, and that
> unsigned integer addition is associative (when we don't have wrapping
> restrictions), we also don't need to discuss the order of the additions either
> (they're irrelevant; all orderings would be permitted under 'as if' regardless
> of what we specified).
> 
>  -Hal
> 
> >
> > >
> > >  -Hal
> > >
> > > >
> > > > Regards,
> > > > Shahid
> > > >
> > > > > -----Original Message-----
> > > > > From: Hal Finkel [mailto:hfinkel at anl.gov]
> > > > > Sent: Wednesday, September 30, 2015 3:42 AM
> > > > > To: Shahid, Asghar-ahmad
> > > > > Cc: llvm-commits at lists.llvm.org;
> > > > > reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org;
> renato
> > > > > golin;
> > > > > james molloy
> > > > > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*'
> > > > > and
> > > > > corresponding SDNodes for horizontal sum operation.
> > > > >
> > > > > ----- Original Message -----
> > > > > > From: "Asghar-ahmad via llvm-commits Shahid"
> > > > > > <llvm-commits at lists.llvm.org>
> > > > > > To: reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org,
> > > > > > "renato
> > > > > golin" <renato.golin at linaro.org>, "james molloy"
> > > > > > <james.molloy at arm.com>
> > > > > > Cc: llvm-commits at lists.llvm.org
> > > > > > Sent: Sunday, September 27, 2015 11:42:46 PM
> > > > > > Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*'
> > > > > > and
> > > > > corresponding SDNodes for horizontal sum operation.
> > > > > >
> > > > > > Hi Hal,
> > > > > >
> > > > > > Response inlined.
> > > > > >
> > > > > > Regards,
> > > > > > Shahid
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: hfinkel at anl.gov [mailto:hfinkel at anl.gov]
> > > > > > > Sent: Saturday, September 26, 2015 2:58 AM
> > > > > > > To: Shahid, Asghar-ahmad; renato.golin at linaro.org;
> > > > > > > hfinkel at anl.gov; james.molloy at arm.com
> > > > > > > Cc: llvm-dev at redking.me.uk; bruno.cardoso at gmail.com; llvm-
> > > > > > > commits at lists.llvm.org
> > > > > > > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics
> > > > > > > 'hsum*'
> > > > > > > and
> > > > > > > corresponding SDNodes for horizontal sum operation.
> > > > > > >
> > > > > > > hfinkel added a comment.
> > > > > > >
> > > > > > > I need to take a step back here; why are we doing this
> > > > > > > again?
> > > > > > >
> > > > > > > I read again the RFC thread
> > > > > > > (http://lists.llvm.org/pipermail/llvm-dev/2015-
> > > > > > > May/085078.html), and it ended with the following (from
> > > > > > > Renato):
> > > > > > >
> > > > > > > > > BTW, now my plan is to just add the two intrinsics for
> > > > > > > > > 'absolute
> > > > > > > difference'
> > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > > >  and 'horizontal add'.
> > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > > That's ok, as long as they're impossible to represent in
> > > > > > > > plain IR.
> > > > > > >
> > > > > > >
> > > > > > > and I think that we were all in agreement on this point.
> > > > > > > But now
> > > > > > > I'm not sure you've demonstrated the prerequisite.
> > > > > > Earlier to this RFC there was a discussion around a patch
> > > > > > (http://reviews.llvm.org/D9029) regarding llvm.sad intrinsic
> > > > > > which highlights the need to have llvm.hsum and friend
> > > > > > intrinsics. This RFC was a result of that discussion.
> > > > > >
> > > > > > > The underlying operations here (and in
> > > > > > > http://reviews.llvm.org/D10867) seem like they are
> > > > > > > representable using IR (as demonstrated by the fact that you
> > > > > > > provide potentially-equivalent IR sequences in the
> > > > > > > documentation), except for the ordering freedom here.
> > > > > >
> > > > > > Of course the logical expansion of the underlying operation is
> > > > > > trivially representable but not the pattern matching in
> > > > > > codegen.
> > > > > > Actually pattern matching is complex and may well be fragile
> > > > > > in codegen due to the long chain of generic IR to be matched.
> > > > > > Moreover movements of truncate/extends ops in DAG increase the
> > > > > > complexity of matching further.
> > > > >
> > > > > It might, but it is not clearly bad enough to require an
> > > > > alternative solution.
> > > > >
> > > > > In my experience, the patterns that get difficult to match at
> > > > > the SDAG level are those that consist, in whole or in part, of
> > > > > operations that are illegal (either involving illegal types or
> > > > > illegal operations). This becomes difficult because you either
> > > > > need to match the post-legalized nodes (which might involve
> > > > > undoing scalarization, etc.) or you need to match the nodes
> > > > > pre-legalization (where you cannot depend on the
> > > > > canonicalization that restricts the state space later). Here,
> > > > > however, we seem to be dealing with patterns of otherwise legal
> > > > > nodes, on otherwise legal vector types, and while the matching
> > > > > might not be trivial, it is not obviously difficult in
> > > > > *ISelLowering.cpp either.
> > > > >
> > > > > The other issue that I've run into with SDAG pattern matching is
> > > > > the fact that it is basic-block local, and some of the
> > > > > operations you want to match may have been CSE'd or otherwise
> > > > > hoisted into dominating blocks. We generally put code in CGP to
> > > > > deal with these kinds of things.
> > > > >
> > > > > One advantage of putting the code into SDAG pattern matching is
> > > > > that the patterns get used regardless of where the code
> > > > > originates (vectorizers or manually-written vector code). This
> > > > > is important.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > And this, I fear, is where we run into trouble. The thing
> > > > > > > that is not representable in the IR is that the order of
> > > > > > > operations in the horizontal sum is undefined, and at the
> > > > > > > same time, signed overflow is undefined. This cannot be
> > > > > > > represented in the IR because nsw adds don't reassociate,
> > > > > > > and thus, there's no way to represent the sequence of nsw
> > > > > > > adds such that they can be reassociated while retaining
> > > > > > > their nsw property. But the problem is that, because this
> > > > > > > freedom cannot be represented in the IR, we also can't
> > > > > > > generate it from IR in a semantics-preserving way; and,
> > > > > > > thus, it would not be legal to generate it in the
> > > > > > > vectorizers.
> > > > > >
> > > > > > As proposed, we are not going to generate it always in
> > > > > > vectorizer but only based on target checks, can't we also
> > > > > > check for ordering relaxation there?
> > > > > > Or fix the ordering
> > > > > > constraint by some other means.
> > > > >
> > > > > No; vectorization is not special in that regard, and needs to be
> > > > > semantics preserving. We have 'fast math' flags for
> > > > > floating-point operations, which help there, but there are no
> > > > > such flags for integer operations.
> > > > >
> > > > > Do the underlying hardware operations on x86 (etc.) really not
> > > > > define an ordering for the horizontal adds?
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Thus, this change does not seem right, and approving
> > > > > > > http://reviews.llvm.org/D10867 seems like a mistake as well.
> > > > > > > We could certainly fix the definition here to make it
> > > > > > > exactly representable in the IR, but then what's the point
> > > > > > > of doing so?
> > > > > > >
> > > > > > > In the RFC, you mentioned cost modeling as a major
> > > > > > > motivation for adding intrinsics, but that seems like an
> > > > > > > unrelated problem (at least in part).
> > > > > >
> > > > > > In RFC, I had also proposed llvm.sad intrinsic which I thought
> > > > > > would make cost modelling easier for those targets which
> > > > > > support sad operation natively. Hence The motivation. But
> > > > > > Renato's explanation clarified the things related to cost
> > > > > > modeling. However the complexity of pattern matching was
> > > > > > always a bigger issue.
> > > > >
> > > > > Perhaps, but it is localized. And matching adds and subtracts
> > > > > plus shuffles does not seem like it should be unusually painful.
> > > > >
> > > > > >
> > > > > > > During
> > > > > > > vectorization, we can use special interfaces to estimate the
> > > > > > > cost of complex patterns. In fact, we already have an
> > > > > > > interface for
> > > > > > > reductions:
> > > > > > > TTI.getReductionCost. There is a second relevant code
> > > > > > > model:
> > > > > > > That
> > > > > > > used by the unroller and inliner. Vectorization happens
> > > > > > > after inlining, so that interaction is not really relevant,
> > > > > > > but partial unrolling happens after vectorization, and so
> > > > > > > the cost model there might want to understand that a complex
> > > > > > > sequence of shuffles, extracts and adds has a
> > > > > > > disproportionately-
> > > > > > > low cost. The same is true of the inliner if the input IR
> > > > > > > uses vector types and initially contains these operations,
> > > > > > > but even in that case, you'd not want to canonicalize on the
> > > > > > > intrinsics too early in case other optimizations remove the
> > > > > > > need for most of the operations. Thus, in the end, you need
> > > > > > > pattern-
> > > > > > > matching code near the end of the pipeline anyway to account
> > > > > > > for input IR directly containing relevant operations on
> > > > > > > vector types.
> > > > > > >
> > > > > > > In short, I don't understand why we're going in this
> > > > > > > direction.
> > > > > > > You
> > > > > > > can match these operations in the backend, and you can match
> > > > > > > them in the cost-model code. If we can't do the latter, then
> > > > > > > we should fix that. And we already have special interfaces
> > > > > > > for vectorization costs for complex operations such as this.
> > > > > > >
> > > > > > >
> > > > > > > http://reviews.llvm.org/D10964
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > llvm-commits mailing list
> > > > > > llvm-commits at lists.llvm.org
> > > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
> > > > > >
> > > > >
> > > > > --
> > > > > Hal Finkel
> > > > > Assistant Computational Scientist Leadership Computing Facility
> > > > > Argonne National Laboratory
> > > >
> > >
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> >
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory


More information about the llvm-commits mailing list