[PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.

Wed Nov 25 16:34:07 PST 2015

On Wed, Nov 25, 2015 at 4:16 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
>> From: "Cong Hou" <congh at google.com>
>> To: "Hal Finkel" <hfinkel at anl.gov>
>> Cc: reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org, "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>,
>> "chandlerc" <chandlerc at gmail.com>, "David Li" <davidxl at google.com>, "llvm-commits" <llvm-commits at lists.llvm.org>
>> Sent: Wednesday, November 25, 2015 6:10:48 PM
>> Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.
>>
>> On Wed, Nov 25, 2015 at 3:50 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>> > Hi Cong,
>> >
>> > Thanks!
>> >
>> > Based on this, do we need the @llvm.[us]absdiff intrinsics, or
>> > should we back them out?
>>
>> If we don't do pattern recognition on IR level (in loop vectorizer),
>> I
>> think those intrinsics (absdiff and hsum) are not useful. But I am
>> not
>> sure if they are useful in other cases.
>
> Given that they were added specifically for this use case, and it seems that was premature, let's back them out (we can always add them back should other use cases arise).

Sounds good. Thanks!

Cong

>
>  -Hal
>
>>
>> Cong
>>
>> >
>> >  -Hal
>> >
>> > ----- Original Message -----
>> >> From: "Cong Hou" <congh at google.com>
>> >> To: "Hal Finkel" <hfinkel at anl.gov>, "llvm-commits"
>> >> <llvm-commits at lists.llvm.org>
>> >> Cc: reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org,
>> >> "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>,
>> >> "chandlerc" <chandlerc at gmail.com>, "David Li" <davidxl at google.com>
>> >> Sent: Wednesday, November 25, 2015 4:54:08 PM
>> >> Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
>> >> corresponding SDNodes for horizontal sum operation.
>> >>
>> >> I think this patch is one of several ones that are for detecting
>> >> SAD
>> >> pattern in loop vectorizer. The patch in
>> >> http://reviews.llvm.org/D9029
>> >> tries to introduce a new operation and an intrinsic for it and I
>> >> think
>> >> this patch is an intermediate step.
>> >>
>> >> However, instead of detecting SAD during loop vectorization, I
>> >> think
>> >> it is feasible to do it during instruction selection. The only
>> >> missing
>> >> information is that we need to know which operation in the loop is
>> >> a
>> >> reduction. After a reduction instruction is vectorized, we
>> >> actually
>> >> don't care how elements are organized in the reduction result
>> >> vector;
>> >> we only need to make sure the reduction of elements in the result
>> >> is
>> >> identical to the reduction of all elements of both operands. This
>> >> gives us enough freedom to select better instructions as some
>> >> instructions already do partial reduction like psumbw and pmaddwd
>> >> on
>> >> X86.
>> >>
>> >> The problem is how to detect reduction instructions during
>> >> instruction
>> >> selection. My opinion is that we can annotate reduction phis
>> >> during
>> >> loop vectorization using metadata. Then just before lowering, we
>> >> use
>> >> those annotated reduction phis to find real reduction operations
>> >> (we
>> >> need to make sure it is safe to detect reduction instructions from
>> >> reduction phis, given that there are many transformations after
>> >> loop
>> >> vectorization), and mark them in SDNodeFlags, which can be checked
>> >> during instruction combine.
>> >>
>> >> I've made a patch based on this idea and managed to detect SAD
>> >> pattern
>> >> from either source code or IR:
>> >>
>> >> http://reviews.llvm.org/D14840
>> >>
>> >> Cost model is another concern when detecting SAD pattern. On SSE2,
>> >> we
>> >> can get the most benefit from psadbw with the loop being unrolled
>> >> at
>> >> least 16 times. However, this is impossible now for two reasons:
>> >>
>> >> 1. The vectorization factor is now determined by the widest type
>> >> in
>> >> the loop by default. For SAD loop this means VF=4 on SSE2. With
>> >> the
>> >> patch in http://reviews.llvm.org/D8943 we now have a switch to
>> >> determine VF using the smallest type.
>> >>
>> >> 2. The cost model may prevent the VF 16 being used. This is due to
>> >> large costs of promoting/demoting vector of integers. I am working
>> >> on
>> >> this issue now, and the patch that is under review is
>> >> http://reviews.llvm.org/D14588. Once it is approved, I will go
>> >> ahead
>> >> and update the X86's cost table on vector conversions. I think
>> >> this
>> >> will let loop vectorizer choose 16 in our case above. We can also
>> >> consider to turn on the switch mentioned above by default with
>> >> improved cost model.
>> >>
>> >> With the same method we can add more reduction pattern recognition
>> >> in
>> >> LLVM.
>> >>
>> >> Comments are welcome!
>> >>
>> >>
>> >> thanks,
>> >> Cong
>> >>
>> >>
>> >> On Wed, Nov 25, 2015 at 2:29 AM, Hal Finkel <hfinkel at anl.gov>
>> >> wrote:
>> >> > Cong,
>> >> >
>> >> > Could you please comment on what you think we should do here? It
>> >> > seems like we've now had several threads on this topic, and I'm
>> >> > not sure what the consensus is right now.
>> >> >
>> >> >  -Hal
>> >> >
>> >> > ----- Original Message -----
>> >> >> From: "Hal Finkel via llvm-commits"
>> >> >> <llvm-commits at lists.llvm.org>
>> >> >> To: "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>
>> >> >> Cc: llvm-commits at lists.llvm.org,
>> >> >> reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org
>> >> >> Sent: Tuesday, October 27, 2015 11:59:53 PM
>> >> >> Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*'
>> >> >> and
>> >> >> corresponding SDNodes for horizontal sum operation.
>> >> >>
>> >> >> ----- Original Message -----
>> >> >> > From: "Asghar-ahmad via llvm-commits Shahid"
>> >> >> > <llvm-commits at lists.llvm.org>
>> >> >> > To: reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org,
>> >> >> > "renato golin" <renato.golin at linaro.org>, "james molloy"
>> >> >> > <james.molloy at arm.com>
>> >> >> > Cc: llvm-commits at lists.llvm.org
>> >> >> > Sent: Sunday, September 27, 2015 11:42:46 PM
>> >> >> > Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*'
>> >> >> > and
>> >> >> > corresponding SDNodes for horizontal sum operation.
>> >> >> >
>> >> >> > Hi Hal,
>> >> >> >
>> >> >> > Response inlined.
>> >> >> >
>> >> >> > Regards,
>> >> >> > Shahid
>> >> >> >
>> >> >> > > -----Original Message-----
>> >> >> > > From: hfinkel at anl.gov [mailto:hfinkel at anl.gov]
>> >> >> > > Sent: Saturday, September 26, 2015 2:58 AM
>> >> >> > > To: Shahid, Asghar-ahmad; renato.golin at linaro.org;
>> >> >> > > hfinkel at anl.gov;
>> >> >> > > james.molloy at arm.com
>> >> >> > > Cc: llvm-dev at redking.me.uk; bruno.cardoso at gmail.com; llvm-
>> >> >> > > commits at lists.llvm.org
>> >> >> > > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics
>> >> >> > > 'hsum*'
>> >> >> > > and
>> >> >> > > corresponding SDNodes for horizontal sum operation.
>> >> >> > >
>> >> >> > > hfinkel added a comment.
>> >> >> > >
>> >> >> > > I need to take a step back here; why are we doing this
>> >> >> > > again?
>> >> >> > >
>> >> >> > > I read again the RFC thread
>> >> >> > > (http://lists.llvm.org/pipermail/llvm-dev/2015-
>> >> >> > > May/085078.html), and it ended with the following (from
>> >> >> > > Renato):
>> >> >> > >
>> >> >> > > > > BTW, now my plan is to just add the two intrinsics for
>> >> >> > > > > 'absolute
>> >> >> > > difference'
>> >> >> > >
>> >> >> > > >
>> >> >> > >
>> >> >> > > > >  and 'horizontal add'.
>> >> >> > >
>> >> >> > > >
>> >> >> > >
>> >> >> > > >
>> >> >> > >
>> >> >> > > > That's ok, as long as they're impossible to represent in
>> >> >> > > > plain
>> >> >> > > > IR.
>> >> >> > >
>> >> >> > >
>> >> >> > > and I think that we were all in agreement on this point.
>> >> >> > > But
>> >> >> > > now
>> >> >> > > I'm not sure
>> >> >> > > you've demonstrated the prerequisite.
>> >> >> > Earlier to this RFC there was a discussion around a patch
>> >> >> > (http://reviews.llvm.org/D9029)
>> >> >> > regarding llvm.sad intrinsic which highlights the need to
>> >> >> > have
>> >> >> > llvm.hsum and friend
>> >> >> > intrinsics. This RFC was a result of that discussion.
>> >> >> >
>> >> >> > > The underlying operations here (and
>> >> >> > > in http://reviews.llvm.org/D10867) seem like they are
>> >> >> > > representable
>> >> >> > > using IR
>> >> >> > > (as demonstrated by the fact that you provide
>> >> >> > > potentially-equivalent IR
>> >> >> > > sequences in the documentation), except for the ordering
>> >> >> > > freedom
>> >> >> > > here.
>> >> >> >
>> >> >> > Of course the logical expansion of the underlying operation
>> >> >> > is
>> >> >> > trivially representable
>> >> >> > but not the pattern matching in codegen. Actually pattern
>> >> >> > matching
>> >> >> > is
>> >> >> > complex and
>> >> >> > may well be fragile in codegen due to the long chain of
>> >> >> > generic
>> >> >> > IR
>> >> >> > to
>> >> >> > be matched.
>> >> >> > Moreover movements of truncate/extends ops in DAG increase
>> >> >> > the
>> >> >> > complexity of
>> >> >> > matching further.
>> >> >>
>> >> >> It might be fragile, but I suggest trying so that the issues
>> >> >> are
>> >> >> clear. I don't see how you'd end up with a lot of
>> >> >> truncate/extends
>> >> >> in the middle of a horizontal add that we'd otherwise be able
>> >> >> to
>> >> >> match. It also makes the compiler more robust because we'll be
>> >> >> able
>> >> >> to generate good code from hand-written patterns, not just from
>> >> >> auto-vectorized code.
>> >> >>
>> >> >> >
>> >> >> > >
>> >> >> > > And this, I fear, is where we run into trouble. The thing
>> >> >> > > that
>> >> >> > > is
>> >> >> > > not
>> >> >> > > representable in the IR is that the order of operations in
>> >> >> > > the
>> >> >> > > horizontal sum is
>> >> >> > > undefined, and at the same time, signed overflow is
>> >> >> > > undefined.
>> >> >> > > This
>> >> >> > > cannot
>> >> >> > > be represented in the IR because nsw adds don't
>> >> >> > > reassociate,
>> >> >> > > and
>> >> >> > > thus,
>> >> >> > > there's no way to represent the sequence of nsw adds such
>> >> >> > > that
>> >> >> > > they
>> >> >> > > can be
>> >> >> > > reassociated while retaining their nsw property. But the
>> >> >> > > problem
>> >> >> > > is
>> >> >> > > that,
>> >> >> > > because this freedom cannot be represented in the IR, we
>> >> >> > > also
>> >> >> > > can't
>> >> >> > > generate it from IR in a semantics-preserving way; and,
>> >> >> > > thus,
>> >> >> > > it
>> >> >> > > would not be
>> >> >> > > legal to generate it in the vectorizers.
>> >> >> >
>> >> >> > As proposed, we are not going to generate it always in
>> >> >> > vectorizer
>> >> >> > but
>> >> >> > only based
>> >> >> > on target checks, can't we also check for ordering relaxation
>> >> >> > there?
>> >> >> > Or fix the ordering
>> >> >> > constraint by some other means.
>> >> >>
>> >> >> No; it is a semantics problem. The vectorizer would need to
>> >> >> prove
>> >> >> something it seems unlikely to be able to prove.
>> >> >>
>> >> >>  -Hal
>> >> >>
>> >> >> >
>> >> >> > >
>> >> >> > > Thus, this change does not seem right, and approving
>> >> >> > > http://reviews.llvm.org/D10867 seems like a mistake as
>> >> >> > > well.
>> >> >> > > We
>> >> >> > > could
>> >> >> > > certainly fix the definition here to make it exactly
>> >> >> > > representable
>> >> >> > > in the IR, but
>> >> >> > > then what's the point of doing so?
>> >> >> > >
>> >> >> > > In the RFC, you mentioned cost modeling as a major
>> >> >> > > motivation
>> >> >> > > for
>> >> >> > > adding
>> >> >> > > intrinsics, but that seems like an unrelated problem (at
>> >> >> > > least
>> >> >> > > in
>> >> >> > > part).
>> >> >> >
>> >> >> > In RFC, I had also proposed llvm.sad intrinsic which I
>> >> >> > thought
>> >> >> > would
>> >> >> > make cost
>> >> >> > modelling easier for those targets which support sad
>> >> >> > operation
>> >> >> > natively. Hence
>> >> >> > The motivation. But Renato's explanation clarified the things
>> >> >> > related
>> >> >> > to cost
>> >> >> > modeling. However the complexity of pattern matching was
>> >> >> > always
>> >> >> > a
>> >> >> > bigger
>> >> >> > issue.
>> >> >> >
>> >> >> > > During
>> >> >> > > vectorization, we can use special interfaces to estimate
>> >> >> > > the
>> >> >> > > cost
>> >> >> > > of complex
>> >> >> > > patterns. In fact, we already have an interface for
>> >> >> > > reductions:
>> >> >> > > TTI.getReductionCost. There is a second relevant code
>> >> >> > > model:
>> >> >> > > That
>> >> >> > > used by
>> >> >> > > the unroller and inliner. Vectorization happens after
>> >> >> > > inlining,
>> >> >> > > so
>> >> >> > > that
>> >> >> > > interaction is not really relevant, but partial unrolling
>> >> >> > > happens
>> >> >> > > after
>> >> >> > > vectorization, and so the cost model there might want to
>> >> >> > > understand
>> >> >> > > that a
>> >> >> > > complex sequence of shuffles, extracts and adds has a
>> >> >> > > disproportionately-
>> >> >> > > low cost. The same is true of the inliner if the input IR
>> >> >> > > uses
>> >> >> > > vector types and
>> >> >> > > initially contains these operations, but even in that case,
>> >> >> > > you'd
>> >> >> > > not want to
>> >> >> > > canonicalize on the intrinsics too early in case other
>> >> >> > > optimizations remove the
>> >> >> > > need for most of the operations. Thus, in the end, you need
>> >> >> > > pattern-
>> >> >> > > matching code near the end of the pipeline anyway to
>> >> >> > > account
>> >> >> > > for
>> >> >> > > input IR
>> >> >> > > directly containing relevant operations on vector types.
>> >> >> > >
>> >> >> > > In short, I don't understand why we're going in this
>> >> >> > > direction.
>> >> >> > > You
>> >> >> > > can match
>> >> >> > > these operations in the backend, and you can match them in
>> >> >> > > the
>> >> >> > > cost-model
>> >> >> > > code. If we can't do the latter, then we should fix that.
>> >> >> > > And
>> >> >> > > we
>> >> >> > > already have
>> >> >> > > special interfaces for vectorization costs for complex
>> >> >> > > operations
>> >> >> > > such as this.
>> >> >> > >
>> >> >> > >
>> >> >> > > http://reviews.llvm.org/D10964
>> >> >> > >
>> >> >> > >
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > llvm-commits mailing list
>> >> >> > llvm-commits at lists.llvm.org
>> >> >> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>> >> >> >
>> >> >>
>> >> >> --
>> >> >> Hal Finkel
>> >> >> Assistant Computational Scientist
>> >> >> Leadership Computing Facility
>> >> >> Argonne National Laboratory
>> >> >> _______________________________________________
>> >> >> llvm-commits mailing list
>> >> >> llvm-commits at lists.llvm.org
>> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>> >> >>
>> >> >
>> >> > --
>> >> > Hal Finkel
>> >> > Assistant Computational Scientist
>> >> > Leadership Computing Facility
>> >> > Argonne National Laboratory
>> >>
>> >
>> > --
>> > Hal Finkel
>> > Assistant Computational Scientist
>> > Leadership Computing Facility
>> > Argonne National Laboratory
>>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory