[PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.

Wed Nov 25 15:50:37 PST 2015

Hi Cong,

Thanks!

Based on this, do we need the @llvm.[us]absdiff intrinsics, or should we back them out?

 -Hal

----- Original Message -----
> From: "Cong Hou" <congh at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>, "llvm-commits" <llvm-commits at lists.llvm.org>
> Cc: reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org, "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>,
> "chandlerc" <chandlerc at gmail.com>, "David Li" <davidxl at google.com>
> Sent: Wednesday, November 25, 2015 4:54:08 PM
> Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.
> 
> I think this patch is one of several ones that are for detecting SAD
> pattern in loop vectorizer. The patch in
> http://reviews.llvm.org/D9029
> tries to introduce a new operation and an intrinsic for it and I
> think
> this patch is an intermediate step.
> 
> However, instead of detecting SAD during loop vectorization, I think
> it is feasible to do it during instruction selection. The only
> missing
> information is that we need to know which operation in the loop is a
> reduction. After a reduction instruction is vectorized, we actually
> don't care how elements are organized in the reduction result vector;
> we only need to make sure the reduction of elements in the result is
> identical to the reduction of all elements of both operands. This
> gives us enough freedom to select better instructions as some
> instructions already do partial reduction like psumbw and pmaddwd on
> X86.
> 
> The problem is how to detect reduction instructions during
> instruction
> selection. My opinion is that we can annotate reduction phis during
> loop vectorization using metadata. Then just before lowering, we use
> those annotated reduction phis to find real reduction operations (we
> need to make sure it is safe to detect reduction instructions from
> reduction phis, given that there are many transformations after loop
> vectorization), and mark them in SDNodeFlags, which can be checked
> during instruction combine.
> 
> I've made a patch based on this idea and managed to detect SAD
> pattern
> from either source code or IR:
> 
> http://reviews.llvm.org/D14840
> 
> Cost model is another concern when detecting SAD pattern. On SSE2, we
> can get the most benefit from psadbw with the loop being unrolled at
> least 16 times. However, this is impossible now for two reasons:
> 
> 1. The vectorization factor is now determined by the widest type in
> the loop by default. For SAD loop this means VF=4 on SSE2. With the
> patch in http://reviews.llvm.org/D8943 we now have a switch to
> determine VF using the smallest type.
> 
> 2. The cost model may prevent the VF 16 being used. This is due to
> large costs of promoting/demoting vector of integers. I am working on
> this issue now, and the patch that is under review is
> http://reviews.llvm.org/D14588. Once it is approved, I will go ahead
> and update the X86's cost table on vector conversions. I think this
> will let loop vectorizer choose 16 in our case above. We can also
> consider to turn on the switch mentioned above by default with
> improved cost model.
> 
> With the same method we can add more reduction pattern recognition in
> LLVM.
> 
> Comments are welcome!
> 
> 
> thanks,
> Cong
> 
> 
> On Wed, Nov 25, 2015 at 2:29 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> > Cong,
> >
> > Could you please comment on what you think we should do here? It
> > seems like we've now had several threads on this topic, and I'm
> > not sure what the consensus is right now.
> >
> >  -Hal
> >
> > ----- Original Message -----
> >> From: "Hal Finkel via llvm-commits" <llvm-commits at lists.llvm.org>
> >> To: "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>
> >> Cc: llvm-commits at lists.llvm.org,
> >> reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org
> >> Sent: Tuesday, October 27, 2015 11:59:53 PM
> >> Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
> >> corresponding SDNodes for horizontal sum operation.
> >>
> >> ----- Original Message -----
> >> > From: "Asghar-ahmad via llvm-commits Shahid"
> >> > <llvm-commits at lists.llvm.org>
> >> > To: reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org,
> >> > "renato golin" <renato.golin at linaro.org>, "james molloy"
> >> > <james.molloy at arm.com>
> >> > Cc: llvm-commits at lists.llvm.org
> >> > Sent: Sunday, September 27, 2015 11:42:46 PM
> >> > Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*'
> >> > and
> >> > corresponding SDNodes for horizontal sum operation.
> >> >
> >> > Hi Hal,
> >> >
> >> > Response inlined.
> >> >
> >> > Regards,
> >> > Shahid
> >> >
> >> > > -----Original Message-----
> >> > > From: hfinkel at anl.gov [mailto:hfinkel at anl.gov]
> >> > > Sent: Saturday, September 26, 2015 2:58 AM
> >> > > To: Shahid, Asghar-ahmad; renato.golin at linaro.org;
> >> > > hfinkel at anl.gov;
> >> > > james.molloy at arm.com
> >> > > Cc: llvm-dev at redking.me.uk; bruno.cardoso at gmail.com; llvm-
> >> > > commits at lists.llvm.org
> >> > > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*'
> >> > > and
> >> > > corresponding SDNodes for horizontal sum operation.
> >> > >
> >> > > hfinkel added a comment.
> >> > >
> >> > > I need to take a step back here; why are we doing this again?
> >> > >
> >> > > I read again the RFC thread
> >> > > (http://lists.llvm.org/pipermail/llvm-dev/2015-
> >> > > May/085078.html), and it ended with the following (from
> >> > > Renato):
> >> > >
> >> > > > > BTW, now my plan is to just add the two intrinsics for
> >> > > > > 'absolute
> >> > > difference'
> >> > >
> >> > > >
> >> > >
> >> > > > >  and 'horizontal add'.
> >> > >
> >> > > >
> >> > >
> >> > > >
> >> > >
> >> > > > That's ok, as long as they're impossible to represent in
> >> > > > plain
> >> > > > IR.
> >> > >
> >> > >
> >> > > and I think that we were all in agreement on this point. But
> >> > > now
> >> > > I'm not sure
> >> > > you've demonstrated the prerequisite.
> >> > Earlier to this RFC there was a discussion around a patch
> >> > (http://reviews.llvm.org/D9029)
> >> > regarding llvm.sad intrinsic which highlights the need to have
> >> > llvm.hsum and friend
> >> > intrinsics. This RFC was a result of that discussion.
> >> >
> >> > > The underlying operations here (and
> >> > > in http://reviews.llvm.org/D10867) seem like they are
> >> > > representable
> >> > > using IR
> >> > > (as demonstrated by the fact that you provide
> >> > > potentially-equivalent IR
> >> > > sequences in the documentation), except for the ordering
> >> > > freedom
> >> > > here.
> >> >
> >> > Of course the logical expansion of the underlying operation is
> >> > trivially representable
> >> > but not the pattern matching in codegen. Actually pattern
> >> > matching
> >> > is
> >> > complex and
> >> > may well be fragile in codegen due to the long chain of generic
> >> > IR
> >> > to
> >> > be matched.
> >> > Moreover movements of truncate/extends ops in DAG increase the
> >> > complexity of
> >> > matching further.
> >>
> >> It might be fragile, but I suggest trying so that the issues are
> >> clear. I don't see how you'd end up with a lot of truncate/extends
> >> in the middle of a horizontal add that we'd otherwise be able to
> >> match. It also makes the compiler more robust because we'll be
> >> able
> >> to generate good code from hand-written patterns, not just from
> >> auto-vectorized code.
> >>
> >> >
> >> > >
> >> > > And this, I fear, is where we run into trouble. The thing that
> >> > > is
> >> > > not
> >> > > representable in the IR is that the order of operations in the
> >> > > horizontal sum is
> >> > > undefined, and at the same time, signed overflow is undefined.
> >> > > This
> >> > > cannot
> >> > > be represented in the IR because nsw adds don't reassociate,
> >> > > and
> >> > > thus,
> >> > > there's no way to represent the sequence of nsw adds such that
> >> > > they
> >> > > can be
> >> > > reassociated while retaining their nsw property. But the
> >> > > problem
> >> > > is
> >> > > that,
> >> > > because this freedom cannot be represented in the IR, we also
> >> > > can't
> >> > > generate it from IR in a semantics-preserving way; and, thus,
> >> > > it
> >> > > would not be
> >> > > legal to generate it in the vectorizers.
> >> >
> >> > As proposed, we are not going to generate it always in
> >> > vectorizer
> >> > but
> >> > only based
> >> > on target checks, can't we also check for ordering relaxation
> >> > there?
> >> > Or fix the ordering
> >> > constraint by some other means.
> >>
> >> No; it is a semantics problem. The vectorizer would need to prove
> >> something it seems unlikely to be able to prove.
> >>
> >>  -Hal
> >>
> >> >
> >> > >
> >> > > Thus, this change does not seem right, and approving
> >> > > http://reviews.llvm.org/D10867 seems like a mistake as well.
> >> > > We
> >> > > could
> >> > > certainly fix the definition here to make it exactly
> >> > > representable
> >> > > in the IR, but
> >> > > then what's the point of doing so?
> >> > >
> >> > > In the RFC, you mentioned cost modeling as a major motivation
> >> > > for
> >> > > adding
> >> > > intrinsics, but that seems like an unrelated problem (at least
> >> > > in
> >> > > part).
> >> >
> >> > In RFC, I had also proposed llvm.sad intrinsic which I thought
> >> > would
> >> > make cost
> >> > modelling easier for those targets which support sad operation
> >> > natively. Hence
> >> > The motivation. But Renato's explanation clarified the things
> >> > related
> >> > to cost
> >> > modeling. However the complexity of pattern matching was always
> >> > a
> >> > bigger
> >> > issue.
> >> >
> >> > > During
> >> > > vectorization, we can use special interfaces to estimate the
> >> > > cost
> >> > > of complex
> >> > > patterns. In fact, we already have an interface for
> >> > > reductions:
> >> > > TTI.getReductionCost. There is a second relevant code model:
> >> > > That
> >> > > used by
> >> > > the unroller and inliner. Vectorization happens after
> >> > > inlining,
> >> > > so
> >> > > that
> >> > > interaction is not really relevant, but partial unrolling
> >> > > happens
> >> > > after
> >> > > vectorization, and so the cost model there might want to
> >> > > understand
> >> > > that a
> >> > > complex sequence of shuffles, extracts and adds has a
> >> > > disproportionately-
> >> > > low cost. The same is true of the inliner if the input IR uses
> >> > > vector types and
> >> > > initially contains these operations, but even in that case,
> >> > > you'd
> >> > > not want to
> >> > > canonicalize on the intrinsics too early in case other
> >> > > optimizations remove the
> >> > > need for most of the operations. Thus, in the end, you need
> >> > > pattern-
> >> > > matching code near the end of the pipeline anyway to account
> >> > > for
> >> > > input IR
> >> > > directly containing relevant operations on vector types.
> >> > >
> >> > > In short, I don't understand why we're going in this
> >> > > direction.
> >> > > You
> >> > > can match
> >> > > these operations in the backend, and you can match them in the
> >> > > cost-model
> >> > > code. If we can't do the latter, then we should fix that. And
> >> > > we
> >> > > already have
> >> > > special interfaces for vectorization costs for complex
> >> > > operations
> >> > > such as this.
> >> > >
> >> > >
> >> > > http://reviews.llvm.org/D10964
> >> > >
> >> > >
> >> >
> >> > _______________________________________________
> >> > llvm-commits mailing list
> >> > llvm-commits at lists.llvm.org
> >> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
> >> >
> >>
> >> --
> >> Hal Finkel
> >> Assistant Computational Scientist
> >> Leadership Computing Facility
> >> Argonne National Laboratory
> >> _______________________________________________
> >> llvm-commits mailing list
> >> llvm-commits at lists.llvm.org
> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
> >>
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory