[PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.

Wed Sep 30 15:43:32 PDT 2015

----- Original Message -----
> From: "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: llvm-commits at lists.llvm.org, reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org, "renato golin"
> <renato.golin at linaro.org>, "james molloy" <james.molloy at arm.com>
> Sent: Wednesday, September 30, 2015 1:19:06 AM
> Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.
> 
> Hi Hal,
> 
> Let me present the full context. For below example:
> ===============================================
> unsigned sad_fun(unsigned char *a1, unsigned char *a2)  {
> unsigned sum =0;
>     for(int i=0; i<16; i++) {
>       sum += abs(a1[i] - a2[i]);
>     }
>   return sum;
> }
> ===============================================
> 
> LLVM IR:
> 
> ; ModuleID = 'sad_1.c'
> target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
> target triple = "x86_64-unknown-linux-gnu"
> 
> ; Function Attrs: nounwind readonly uwtable
> define i32 @sad_fun(i8* nocapture readonly %a1, i8* nocapture
> readonly %a2) #0 {
> entry:
>   br label %vector.body
> 
> vector.body:                                      ; preds = %entry
>   %0 = bitcast i8* %a1 to <4 x i8>*
>   %wide.load = load <4 x i8>, <4 x i8>* %0, align 1, !tbaa !1
>   %1 = zext <4 x i8> %wide.load to <4 x i32>
>   %2 = bitcast i8* %a2 to <4 x i8>*
>   %wide.load12 = load <4 x i8>, <4 x i8>* %2, align 1, !tbaa !1
>   %3 = zext <4 x i8> %wide.load12 to <4 x i32>
>   %4 = sub nsw <4 x i32> %1, %3
>   %5 = icmp sgt <4 x i32> %4, <i32 -1, i32 -1, i32 -1, i32 -1>
>   %6 = sub nsw <4 x i32> zeroinitializer, %4
>   %7 = select <4 x i1> %5, <4 x i32> %4, <4 x i32> %6
>   %8 = getelementptr inbounds i8, i8* %a1, i64 4
>   %9 = bitcast i8* %8 to <4 x i8>*
>   %wide.load.1 = load <4 x i8>, <4 x i8>* %9, align 1, !tbaa !1
>   %10 = zext <4 x i8> %wide.load.1 to <4 x i32>
>   %11 = getelementptr inbounds i8, i8* %a2, i64 4
>   %12 = bitcast i8* %11 to <4 x i8>*
>   %wide.load12.1 = load <4 x i8>, <4 x i8>* %12, align 1, !tbaa !1
>   %13 = zext <4 x i8> %wide.load12.1 to <4 x i32>
>   %14 = sub nsw <4 x i32> %10, %13
>   %15 = icmp sgt <4 x i32> %14, <i32 -1, i32 -1, i32 -1, i32 -1>
>   %16 = sub nsw <4 x i32> zeroinitializer, %14
>   %17 = select <4 x i1> %15, <4 x i32> %14, <4 x i32> %16
>   %18 = add nsw <4 x i32> %17, %7
>   %19 = getelementptr inbounds i8, i8* %a1, i64 8
>   %20 = bitcast i8* %19 to <4 x i8>*
>   %wide.load.2 = load <4 x i8>, <4 x i8>* %20, align 1, !tbaa !1
>   %21 = zext <4 x i8> %wide.load.2 to <4 x i32>
>   %22 = getelementptr inbounds i8, i8* %a2, i64 8
>   %23 = bitcast i8* %22 to <4 x i8>*
>   %wide.load12.2 = load <4 x i8>, <4 x i8>* %23, align 1, !tbaa !1
>   %24 = zext <4 x i8> %wide.load12.2 to <4 x i32>
>   %25 = sub nsw <4 x i32> %21, %24
>   %26 = icmp sgt <4 x i32> %25, <i32 -1, i32 -1, i32 -1, i32 -1>
>   %27 = sub nsw <4 x i32> zeroinitializer, %25
>   %28 = select <4 x i1> %26, <4 x i32> %25, <4 x i32> %27
>   %29 = add nsw <4 x i32> %28, %18
>   %30 = getelementptr inbounds i8, i8* %a1, i64 12
>   %31 = bitcast i8* %30 to <4 x i8>*
>   %wide.load.3 = load <4 x i8>, <4 x i8>* %31, align 1, !tbaa !1
>   %32 = zext <4 x i8> %wide.load.3 to <4 x i32>
>   %33 = getelementptr inbounds i8, i8* %a2, i64 12
>   %34 = bitcast i8* %33 to <4 x i8>*
>   %wide.load12.3 = load <4 x i8>, <4 x i8>* %34, align 1, !tbaa !1
>   %35 = zext <4 x i8> %wide.load12.3 to <4 x i32>
>   %36 = sub nsw <4 x i32> %32, %35
>   %37 = icmp sgt <4 x i32> %36, <i32 -1, i32 -1, i32 -1, i32 -1>
>   %38 = sub nsw <4 x i32> zeroinitializer, %36
>   %39 = select <4 x i1> %37, <4 x i32> %36, <4 x i32> %38
>   %40 = add nsw <4 x i32> %39, %29
>   %rdx.shuf = shufflevector <4 x i32> %40, <4 x i32> undef, <4 x i32>
>   <i32 2, i32 3, i32 undef, i32 undef>
>   %bin.rdx = add <4 x i32> %40, %rdx.shuf
>   %rdx.shuf13 = shufflevector <4 x i32> %bin.rdx, <4 x i32> undef, <4
>   x i32> <i32 1, i32 undef, i32 undef, i32 undef>
>   %bin.rdx14 = add <4 x i32> %bin.rdx, %rdx.shuf13
>   %41 = extractelement <4 x i32> %bin.rdx14, i32 0
>   ret i32 %41
> }
> 
> In the above LLVM IR, vectorizer has vectorized the SAD pattern with
> vector type v4xi8.
> For example on X86, we want to generate "v16xi8" version of *PSAD*
> instruction. However to do this we
> need to catch the full chain of vectorized absolute diffrence and the
> reduction sum.
> We found this is difficult to pattern match in codegen and thought to
> go the intrinsic way.

Put you need the vectorizer to vectorize in terms of v16xi8 regardless of how you match the sum. You say that the pattern is difficult to match, but do you mean matching it with the right or the wrong vectorization factor?

> We also realized
> That with intrinsic we can also use SLP to vectorize the unrolled
> pattern if required. Then we simplified the
> SAD intrinsic operation into two intrinsic which can be utilized to
> generate "SAD" operation as well as in
> different context. For example "hsum",  as mentioned by Bruno, can be
> used in context of POPCNT operation.
> 
> > No; vectorization is not special in that regard, and needs to be
> > semantics
> > preserving. We have 'fast math' flags for floating-point
> > operations, which
> > help there, but there are no such flags for integer operations.
> Do we need ordering definition for integer ADD operation?
> 
> > Do the underlying hardware operations on x86 (etc.) really not
> > define an
> > ordering for the horizontal adds?
>  X86 *PSAD*,  which is for integer type does the horizontal add on
>  intermediate result, does not define
> ordering.

Do you mean that the intermediate accumulator has a larger number of bits than each individual vector element?

 -Hal

> 
> Regards,
> Shahid
> 
> > -----Original Message-----
> > From: Hal Finkel [mailto:hfinkel at anl.gov]
> > Sent: Wednesday, September 30, 2015 3:42 AM
> > To: Shahid, Asghar-ahmad
> > Cc: llvm-commits at lists.llvm.org;
> > reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org; renato
> > golin;
> > james molloy
> > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> > corresponding SDNodes for horizontal sum operation.
> > 
> > ----- Original Message -----
> > > From: "Asghar-ahmad via llvm-commits Shahid"
> > > <llvm-commits at lists.llvm.org>
> > > To: reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org,
> > > "renato
> > golin" <renato.golin at linaro.org>, "james molloy"
> > > <james.molloy at arm.com>
> > > Cc: llvm-commits at lists.llvm.org
> > > Sent: Sunday, September 27, 2015 11:42:46 PM
> > > Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> > corresponding SDNodes for horizontal sum operation.
> > >
> > > Hi Hal,
> > >
> > > Response inlined.
> > >
> > > Regards,
> > > Shahid
> > >
> > > > -----Original Message-----
> > > > From: hfinkel at anl.gov [mailto:hfinkel at anl.gov]
> > > > Sent: Saturday, September 26, 2015 2:58 AM
> > > > To: Shahid, Asghar-ahmad; renato.golin at linaro.org;
> > > > hfinkel at anl.gov;
> > > > james.molloy at arm.com
> > > > Cc: llvm-dev at redking.me.uk; bruno.cardoso at gmail.com; llvm-
> > > > commits at lists.llvm.org
> > > > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*'
> > > > and
> > > > corresponding SDNodes for horizontal sum operation.
> > > >
> > > > hfinkel added a comment.
> > > >
> > > > I need to take a step back here; why are we doing this again?
> > > >
> > > > I read again the RFC thread
> > > > (http://lists.llvm.org/pipermail/llvm-dev/2015-
> > > > May/085078.html), and it ended with the following (from
> > > > Renato):
> > > >
> > > > > > BTW, now my plan is to just add the two intrinsics for
> > > > > > 'absolute
> > > > difference'
> > > >
> > > > >
> > > >
> > > > > >  and 'horizontal add'.
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > That's ok, as long as they're impossible to represent in
> > > > > plain IR.
> > > >
> > > >
> > > > and I think that we were all in agreement on this point. But
> > > > now I'm
> > > > not sure you've demonstrated the prerequisite.
> > > Earlier to this RFC there was a discussion around a patch
> > > (http://reviews.llvm.org/D9029)
> > > regarding llvm.sad intrinsic which highlights the need to have
> > > llvm.hsum and friend intrinsics. This RFC was a result of that
> > > discussion.
> > >
> > > > The underlying operations here (and
> > > > in http://reviews.llvm.org/D10867) seem like they are
> > > > representable
> > > > using IR (as demonstrated by the fact that you provide
> > > > potentially-equivalent IR sequences in the documentation),
> > > > except
> > > > for the ordering freedom here.
> > >
> > > Of course the logical expansion of the underlying operation is
> > > trivially representable but not the pattern matching in codegen.
> > > Actually pattern matching is complex and may well be fragile in
> > > codegen due to the long chain of generic IR to be matched.
> > > Moreover movements of truncate/extends ops in DAG increase the
> > > complexity of matching further.
> > 
> > It might, but it is not clearly bad enough to require an
> > alternative solution.
> > 
> > In my experience, the patterns that get difficult to match at the
> > SDAG level
> > are those that consist, in whole or in part, of operations that are
> > illegal (either
> > involving illegal types or illegal operations). This becomes
> > difficult because
> > you either need to match the post-legalized nodes (which might
> > involve
> > undoing scalarization, etc.) or you need to match the nodes
> > pre-legalization
> > (where you cannot depend on the canonicalization that restricts the
> > state
> > space later). Here, however, we seem to be dealing with patterns of
> > otherwise legal nodes, on otherwise legal vector types, and while
> > the
> > matching might not be trivial, it is not obviously difficult in
> > *ISelLowering.cpp
> > either.
> > 
> > The other issue that I've run into with SDAG pattern matching is
> > the fact that
> > it is basic-block local, and some of the operations you want to
> > match may
> > have been CSE'd or otherwise hoisted into dominating blocks. We
> > generally
> > put code in CGP to deal with these kinds of things.
> > 
> > One advantage of putting the code into SDAG pattern matching is
> > that the
> > patterns get used regardless of where the code originates
> > (vectorizers or
> > manually-written vector code). This is important.
> > 
> > >
> > > >
> > > > And this, I fear, is where we run into trouble. The thing that
> > > > is
> > > > not representable in the IR is that the order of operations in
> > > > the
> > > > horizontal sum is undefined, and at the same time, signed
> > > > overflow
> > > > is undefined. This cannot be represented in the IR because nsw
> > > > adds
> > > > don't reassociate, and thus, there's no way to represent the
> > > > sequence of nsw adds such that they can be reassociated while
> > > > retaining their nsw property. But the problem is that, because
> > > > this
> > > > freedom cannot be represented in the IR, we also can't generate
> > > > it
> > > > from IR in a semantics-preserving way; and, thus, it would not
> > > > be
> > > > legal to generate it in the vectorizers.
> > >
> > > As proposed, we are not going to generate it always in vectorizer
> > > but
> > > only based on target checks, can't we also check for ordering
> > > relaxation there?
> > > Or fix the ordering
> > > constraint by some other means.
> > 
> > No; vectorization is not special in that regard, and needs to be
> > semantics
> > preserving. We have 'fast math' flags for floating-point
> > operations, which
> > help there, but there are no such flags for integer operations.
> > 
> > Do the underlying hardware operations on x86 (etc.) really not
> > define an
> > ordering for the horizontal adds?
> > 
> > >
> > > >
> > > > Thus, this change does not seem right, and approving
> > > > http://reviews.llvm.org/D10867 seems like a mistake as well. We
> > > > could certainly fix the definition here to make it exactly
> > > > representable in the IR, but then what's the point of doing so?
> > > >
> > > > In the RFC, you mentioned cost modeling as a major motivation
> > > > for
> > > > adding intrinsics, but that seems like an unrelated problem (at
> > > > least in part).
> > >
> > > In RFC, I had also proposed llvm.sad intrinsic which I thought
> > > would
> > > make cost modelling easier for those targets which support sad
> > > operation natively. Hence The motivation. But Renato's
> > > explanation
> > > clarified the things related to cost modeling. However the
> > > complexity
> > > of pattern matching was always a bigger issue.
> > 
> > Perhaps, but it is localized. And matching adds and subtracts plus
> > shuffles
> > does not seem like it should be unusually painful.
> > 
> > >
> > > > During
> > > > vectorization, we can use special interfaces to estimate the
> > > > cost of
> > > > complex patterns. In fact, we already have an interface for
> > > > reductions:
> > > > TTI.getReductionCost. There is a second relevant code model:
> > > > That
> > > > used by the unroller and inliner. Vectorization happens after
> > > > inlining, so that interaction is not really relevant, but
> > > > partial
> > > > unrolling happens after vectorization, and so the cost model
> > > > there
> > > > might want to understand that a complex sequence of shuffles,
> > > > extracts and adds has a
> > > > disproportionately-
> > > > low cost. The same is true of the inliner if the input IR uses
> > > > vector types and initially contains these operations, but even
> > > > in
> > > > that case, you'd not want to canonicalize on the intrinsics too
> > > > early in case other optimizations remove the need for most of
> > > > the
> > > > operations. Thus, in the end, you need
> > > > pattern-
> > > > matching code near the end of the pipeline anyway to account
> > > > for
> > > > input IR directly containing relevant operations on vector
> > > > types.
> > > >
> > > > In short, I don't understand why we're going in this direction.
> > > > You
> > > > can match these operations in the backend, and you can match
> > > > them in
> > > > the cost-model code. If we can't do the latter, then we should
> > > > fix
> > > > that. And we already have special interfaces for vectorization
> > > > costs
> > > > for complex operations such as this.
> > > >
> > > >
> > > > http://reviews.llvm.org/D10964
> > > >
> > > >
> > >
> > > _______________________________________________
> > > llvm-commits mailing list
> > > llvm-commits at lists.llvm.org
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
> > >
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory