[PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.

Tue Sep 29 15:11:40 PDT 2015

----- Original Message -----
> From: "Asghar-ahmad via llvm-commits Shahid" <llvm-commits at lists.llvm.org>
> To: reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org, "renato golin" <renato.golin at linaro.org>, "james molloy"
> <james.molloy at arm.com>
> Cc: llvm-commits at lists.llvm.org
> Sent: Sunday, September 27, 2015 11:42:46 PM
> Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.
> 
> Hi Hal,
> 
> Response inlined.
> 
> Regards,
> Shahid
> 
> > -----Original Message-----
> > From: hfinkel at anl.gov [mailto:hfinkel at anl.gov]
> > Sent: Saturday, September 26, 2015 2:58 AM
> > To: Shahid, Asghar-ahmad; renato.golin at linaro.org; hfinkel at anl.gov;
> > james.molloy at arm.com
> > Cc: llvm-dev at redking.me.uk; bruno.cardoso at gmail.com; llvm-
> > commits at lists.llvm.org
> > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> > corresponding SDNodes for horizontal sum operation.
> > 
> > hfinkel added a comment.
> > 
> > I need to take a step back here; why are we doing this again?
> > 
> > I read again the RFC thread
> > (http://lists.llvm.org/pipermail/llvm-dev/2015-
> > May/085078.html), and it ended with the following (from Renato):
> > 
> > > > BTW, now my plan is to just add the two intrinsics for
> > > > 'absolute
> > difference'
> > 
> > >
> > 
> > > >  and 'horizontal add'.
> > 
> > >
> > 
> > >
> > 
> > > That's ok, as long as they're impossible to represent in plain
> > > IR.
> > 
> > 
> > and I think that we were all in agreement on this point. But now
> > I'm not sure
> > you've demonstrated the prerequisite.
> Earlier to this RFC there was a discussion around a patch
> (http://reviews.llvm.org/D9029)
> regarding llvm.sad intrinsic which highlights the need to have
> llvm.hsum and friend
> intrinsics. This RFC was a result of that discussion.
> 
> > The underlying operations here (and
> > in http://reviews.llvm.org/D10867) seem like they are representable
> > using IR
> > (as demonstrated by the fact that you provide
> > potentially-equivalent IR
> > sequences in the documentation), except for the ordering freedom
> > here.
> 
> Of course the logical expansion of the underlying operation is
> trivially representable
> but not the pattern matching in codegen. Actually pattern matching is
> complex and
> may well be fragile in codegen due to the long chain of generic IR to
> be matched.
> Moreover movements of truncate/extends ops in DAG increase the
> complexity of
> matching further.

It might, but it is not clearly bad enough to require an alternative solution.

In my experience, the patterns that get difficult to match at the SDAG level are those that consist, in whole or in part, of operations that are illegal (either involving illegal types or illegal operations). This becomes difficult because you either need to match the post-legalized nodes (which might involve undoing scalarization, etc.) or you need to match the nodes pre-legalization (where you cannot depend on the canonicalization that restricts the state space later). Here, however, we seem to be dealing with patterns of otherwise legal nodes, on otherwise legal vector types, and while the matching might not be trivial, it is not obviously difficult in *ISelLowering.cpp either.

The other issue that I've run into with SDAG pattern matching is the fact that it is basic-block local, and some of the operations you want to match may have been CSE'd or otherwise hoisted into dominating blocks. We generally put code in CGP to deal with these kinds of things.

One advantage of putting the code into SDAG pattern matching is that the patterns get used regardless of where the code originates (vectorizers or manually-written vector code). This is important.

> 
> > 
> > And this, I fear, is where we run into trouble. The thing that is
> > not
> > representable in the IR is that the order of operations in the
> > horizontal sum is
> > undefined, and at the same time, signed overflow is undefined. This
> > cannot
> > be represented in the IR because nsw adds don't reassociate, and
> > thus,
> > there's no way to represent the sequence of nsw adds such that they
> > can be
> > reassociated while retaining their nsw property. But the problem is
> > that,
> > because this freedom cannot be represented in the IR, we also can't
> > generate it from IR in a semantics-preserving way; and, thus, it
> > would not be
> > legal to generate it in the vectorizers.
> 
> As proposed, we are not going to generate it always in vectorizer but
> only based
> on target checks, can't we also check for ordering relaxation there?
> Or fix the ordering
> constraint by some other means.

No; vectorization is not special in that regard, and needs to be semantics preserving. We have 'fast math' flags for floating-point operations, which help there, but there are no such flags for integer operations.

Do the underlying hardware operations on x86 (etc.) really not define an ordering for the horizontal adds?

> 
> > 
> > Thus, this change does not seem right, and approving
> > http://reviews.llvm.org/D10867 seems like a mistake as well. We
> > could
> > certainly fix the definition here to make it exactly representable
> > in the IR, but
> > then what's the point of doing so?
> > 
> > In the RFC, you mentioned cost modeling as a major motivation for
> > adding
> > intrinsics, but that seems like an unrelated problem (at least in
> > part).
> 
> In RFC, I had also proposed llvm.sad intrinsic which I thought would
> make cost
> modelling easier for those targets which support sad operation
> natively. Hence
> The motivation. But Renato's explanation clarified the things related
> to cost
> modeling. However the complexity of pattern matching was always a
> bigger
> issue.

Perhaps, but it is localized. And matching adds and subtracts plus shuffles does not seem like it should be unusually painful.

> 
> > During
> > vectorization, we can use special interfaces to estimate the cost
> > of complex
> > patterns. In fact, we already have an interface for reductions:
> > TTI.getReductionCost. There is a second relevant code model: That
> > used by
> > the unroller and inliner. Vectorization happens after inlining, so
> > that
> > interaction is not really relevant, but partial unrolling happens
> > after
> > vectorization, and so the cost model there might want to understand
> > that a
> > complex sequence of shuffles, extracts and adds has a
> > disproportionately-
> > low cost. The same is true of the inliner if the input IR uses
> > vector types and
> > initially contains these operations, but even in that case, you'd
> > not want to
> > canonicalize on the intrinsics too early in case other
> > optimizations remove the
> > need for most of the operations. Thus, in the end, you need
> > pattern-
> > matching code near the end of the pipeline anyway to account for
> > input IR
> > directly containing relevant operations on vector types.
> > 
> > In short, I don't understand why we're going in this direction. You
> > can match
> > these operations in the backend, and you can match them in the
> > cost-model
> > code. If we can't do the latter, then we should fix that. And we
> > already have
> > special interfaces for vectorization costs for complex operations
> > such as this.
> > 
> > 
> > http://reviews.llvm.org/D10964
> > 
> > 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory