[PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.

Sun Sep 27 21:42:46 PDT 2015

Hi Hal,

Response inlined.

Regards,
Shahid

> -----Original Message-----
> From: hfinkel at anl.gov [mailto:hfinkel at anl.gov]
> Sent: Saturday, September 26, 2015 2:58 AM
> To: Shahid, Asghar-ahmad; renato.golin at linaro.org; hfinkel at anl.gov;
> james.molloy at arm.com
> Cc: llvm-dev at redking.me.uk; bruno.cardoso at gmail.com; llvm-
> commits at lists.llvm.org
> Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> corresponding SDNodes for horizontal sum operation.
> 
> hfinkel added a comment.
> 
> I need to take a step back here; why are we doing this again?
> 
> I read again the RFC thread (http://lists.llvm.org/pipermail/llvm-dev/2015-
> May/085078.html), and it ended with the following (from Renato):
> 
> > > BTW, now my plan is to just add the two intrinsics for 'absolute
> difference'
> 
> >
> 
> > >  and 'horizontal add'.
> 
> >
> 
> >
> 
> > That's ok, as long as they're impossible to represent in plain IR.
> 
> 
> and I think that we were all in agreement on this point. But now I'm not sure
> you've demonstrated the prerequisite.
Earlier to this RFC there was a discussion around a patch (http://reviews.llvm.org/D9029)
regarding llvm.sad intrinsic which highlights the need to have llvm.hsum and friend
intrinsics. This RFC was a result of that discussion.

> The underlying operations here (and
> in http://reviews.llvm.org/D10867) seem like they are representable using IR
> (as demonstrated by the fact that you provide potentially-equivalent IR
> sequences in the documentation), except for the ordering freedom here.

Of course the logical expansion of the underlying operation is trivially representable
but not the pattern matching in codegen. Actually pattern matching is complex and
may well be fragile in codegen due to the long chain of generic IR to be matched.
Moreover movements of truncate/extends ops in DAG increase the complexity of
matching further.

> 
> And this, I fear, is where we run into trouble. The thing that is not
> representable in the IR is that the order of operations in the horizontal sum is
> undefined, and at the same time, signed overflow is undefined. This cannot
> be represented in the IR because nsw adds don't reassociate, and thus,
> there's no way to represent the sequence of nsw adds such that they can be
> reassociated while retaining their nsw property. But the problem is that,
> because this freedom cannot be represented in the IR, we also can't
> generate it from IR in a semantics-preserving way; and, thus, it would not be
> legal to generate it in the vectorizers.

As proposed, we are not going to generate it always in vectorizer but only based
on target checks, can't we also check for ordering relaxation there? Or fix the ordering
constraint by some other means.

> 
> Thus, this change does not seem right, and approving
> http://reviews.llvm.org/D10867 seems like a mistake as well. We could
> certainly fix the definition here to make it exactly representable in the IR, but
> then what's the point of doing so?
> 
> In the RFC, you mentioned cost modeling as a major motivation for adding
> intrinsics, but that seems like an unrelated problem (at least in part). 

In RFC, I had also proposed llvm.sad intrinsic which I thought would make cost
modelling easier for those targets which support sad operation natively. Hence
The motivation. But Renato's explanation clarified the things related to cost
modeling. However the complexity of pattern matching was always a bigger
issue.

> During
> vectorization, we can use special interfaces to estimate the cost of complex
> patterns. In fact, we already have an interface for reductions:
> TTI.getReductionCost. There is a second relevant code model: That used by
> the unroller and inliner. Vectorization happens after inlining, so that
> interaction is not really relevant, but partial unrolling happens after
> vectorization, and so the cost model there might want to understand that a
> complex sequence of shuffles, extracts and adds has a disproportionately-
> low cost. The same is true of the inliner if the input IR uses vector types and
> initially contains these operations, but even in that case, you'd not want to
> canonicalize on the intrinsics too early in case other optimizations remove the
> need for most of the operations. Thus, in the end, you need pattern-
> matching code near the end of the pipeline anyway to account for input IR
> directly containing relevant operations on vector types.
> 
> In short, I don't understand why we're going in this direction. You can match
> these operations in the backend, and you can match them in the cost-model
> code. If we can't do the latter, then we should fix that. And we already have
> special interfaces for vectorization costs for complex operations such as this.
> 
> 
> http://reviews.llvm.org/D10964
> 
>