[PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.

Cong Hou via llvm-commits llvm-commits at lists.llvm.org
Wed Nov 25 16:10:48 PST 2015


On Wed, Nov 25, 2015 at 3:50 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> Hi Cong,
>
> Thanks!
>
> Based on this, do we need the @llvm.[us]absdiff intrinsics, or should we back them out?

If we don't do pattern recognition on IR level (in loop vectorizer), I
think those intrinsics (absdiff and hsum) are not useful. But I am not
sure if they are useful in other cases.

Cong

>
>  -Hal
>
> ----- Original Message -----
>> From: "Cong Hou" <congh at google.com>
>> To: "Hal Finkel" <hfinkel at anl.gov>, "llvm-commits" <llvm-commits at lists.llvm.org>
>> Cc: reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org, "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>,
>> "chandlerc" <chandlerc at gmail.com>, "David Li" <davidxl at google.com>
>> Sent: Wednesday, November 25, 2015 4:54:08 PM
>> Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.
>>
>> I think this patch is one of several ones that are for detecting SAD
>> pattern in loop vectorizer. The patch in
>> http://reviews.llvm.org/D9029
>> tries to introduce a new operation and an intrinsic for it and I
>> think
>> this patch is an intermediate step.
>>
>> However, instead of detecting SAD during loop vectorization, I think
>> it is feasible to do it during instruction selection. The only
>> missing
>> information is that we need to know which operation in the loop is a
>> reduction. After a reduction instruction is vectorized, we actually
>> don't care how elements are organized in the reduction result vector;
>> we only need to make sure the reduction of elements in the result is
>> identical to the reduction of all elements of both operands. This
>> gives us enough freedom to select better instructions as some
>> instructions already do partial reduction like psumbw and pmaddwd on
>> X86.
>>
>> The problem is how to detect reduction instructions during
>> instruction
>> selection. My opinion is that we can annotate reduction phis during
>> loop vectorization using metadata. Then just before lowering, we use
>> those annotated reduction phis to find real reduction operations (we
>> need to make sure it is safe to detect reduction instructions from
>> reduction phis, given that there are many transformations after loop
>> vectorization), and mark them in SDNodeFlags, which can be checked
>> during instruction combine.
>>
>> I've made a patch based on this idea and managed to detect SAD
>> pattern
>> from either source code or IR:
>>
>> http://reviews.llvm.org/D14840
>>
>> Cost model is another concern when detecting SAD pattern. On SSE2, we
>> can get the most benefit from psadbw with the loop being unrolled at
>> least 16 times. However, this is impossible now for two reasons:
>>
>> 1. The vectorization factor is now determined by the widest type in
>> the loop by default. For SAD loop this means VF=4 on SSE2. With the
>> patch in http://reviews.llvm.org/D8943 we now have a switch to
>> determine VF using the smallest type.
>>
>> 2. The cost model may prevent the VF 16 being used. This is due to
>> large costs of promoting/demoting vector of integers. I am working on
>> this issue now, and the patch that is under review is
>> http://reviews.llvm.org/D14588. Once it is approved, I will go ahead
>> and update the X86's cost table on vector conversions. I think this
>> will let loop vectorizer choose 16 in our case above. We can also
>> consider to turn on the switch mentioned above by default with
>> improved cost model.
>>
>> With the same method we can add more reduction pattern recognition in
>> LLVM.
>>
>> Comments are welcome!
>>
>>
>> thanks,
>> Cong
>>
>>
>> On Wed, Nov 25, 2015 at 2:29 AM, Hal Finkel <hfinkel at anl.gov> wrote:
>> > Cong,
>> >
>> > Could you please comment on what you think we should do here? It
>> > seems like we've now had several threads on this topic, and I'm
>> > not sure what the consensus is right now.
>> >
>> >  -Hal
>> >
>> > ----- Original Message -----
>> >> From: "Hal Finkel via llvm-commits" <llvm-commits at lists.llvm.org>
>> >> To: "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>
>> >> Cc: llvm-commits at lists.llvm.org,
>> >> reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org
>> >> Sent: Tuesday, October 27, 2015 11:59:53 PM
>> >> Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
>> >> corresponding SDNodes for horizontal sum operation.
>> >>
>> >> ----- Original Message -----
>> >> > From: "Asghar-ahmad via llvm-commits Shahid"
>> >> > <llvm-commits at lists.llvm.org>
>> >> > To: reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org,
>> >> > "renato golin" <renato.golin at linaro.org>, "james molloy"
>> >> > <james.molloy at arm.com>
>> >> > Cc: llvm-commits at lists.llvm.org
>> >> > Sent: Sunday, September 27, 2015 11:42:46 PM
>> >> > Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*'
>> >> > and
>> >> > corresponding SDNodes for horizontal sum operation.
>> >> >
>> >> > Hi Hal,
>> >> >
>> >> > Response inlined.
>> >> >
>> >> > Regards,
>> >> > Shahid
>> >> >
>> >> > > -----Original Message-----
>> >> > > From: hfinkel at anl.gov [mailto:hfinkel at anl.gov]
>> >> > > Sent: Saturday, September 26, 2015 2:58 AM
>> >> > > To: Shahid, Asghar-ahmad; renato.golin at linaro.org;
>> >> > > hfinkel at anl.gov;
>> >> > > james.molloy at arm.com
>> >> > > Cc: llvm-dev at redking.me.uk; bruno.cardoso at gmail.com; llvm-
>> >> > > commits at lists.llvm.org
>> >> > > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*'
>> >> > > and
>> >> > > corresponding SDNodes for horizontal sum operation.
>> >> > >
>> >> > > hfinkel added a comment.
>> >> > >
>> >> > > I need to take a step back here; why are we doing this again?
>> >> > >
>> >> > > I read again the RFC thread
>> >> > > (http://lists.llvm.org/pipermail/llvm-dev/2015-
>> >> > > May/085078.html), and it ended with the following (from
>> >> > > Renato):
>> >> > >
>> >> > > > > BTW, now my plan is to just add the two intrinsics for
>> >> > > > > 'absolute
>> >> > > difference'
>> >> > >
>> >> > > >
>> >> > >
>> >> > > > >  and 'horizontal add'.
>> >> > >
>> >> > > >
>> >> > >
>> >> > > >
>> >> > >
>> >> > > > That's ok, as long as they're impossible to represent in
>> >> > > > plain
>> >> > > > IR.
>> >> > >
>> >> > >
>> >> > > and I think that we were all in agreement on this point. But
>> >> > > now
>> >> > > I'm not sure
>> >> > > you've demonstrated the prerequisite.
>> >> > Earlier to this RFC there was a discussion around a patch
>> >> > (http://reviews.llvm.org/D9029)
>> >> > regarding llvm.sad intrinsic which highlights the need to have
>> >> > llvm.hsum and friend
>> >> > intrinsics. This RFC was a result of that discussion.
>> >> >
>> >> > > The underlying operations here (and
>> >> > > in http://reviews.llvm.org/D10867) seem like they are
>> >> > > representable
>> >> > > using IR
>> >> > > (as demonstrated by the fact that you provide
>> >> > > potentially-equivalent IR
>> >> > > sequences in the documentation), except for the ordering
>> >> > > freedom
>> >> > > here.
>> >> >
>> >> > Of course the logical expansion of the underlying operation is
>> >> > trivially representable
>> >> > but not the pattern matching in codegen. Actually pattern
>> >> > matching
>> >> > is
>> >> > complex and
>> >> > may well be fragile in codegen due to the long chain of generic
>> >> > IR
>> >> > to
>> >> > be matched.
>> >> > Moreover movements of truncate/extends ops in DAG increase the
>> >> > complexity of
>> >> > matching further.
>> >>
>> >> It might be fragile, but I suggest trying so that the issues are
>> >> clear. I don't see how you'd end up with a lot of truncate/extends
>> >> in the middle of a horizontal add that we'd otherwise be able to
>> >> match. It also makes the compiler more robust because we'll be
>> >> able
>> >> to generate good code from hand-written patterns, not just from
>> >> auto-vectorized code.
>> >>
>> >> >
>> >> > >
>> >> > > And this, I fear, is where we run into trouble. The thing that
>> >> > > is
>> >> > > not
>> >> > > representable in the IR is that the order of operations in the
>> >> > > horizontal sum is
>> >> > > undefined, and at the same time, signed overflow is undefined.
>> >> > > This
>> >> > > cannot
>> >> > > be represented in the IR because nsw adds don't reassociate,
>> >> > > and
>> >> > > thus,
>> >> > > there's no way to represent the sequence of nsw adds such that
>> >> > > they
>> >> > > can be
>> >> > > reassociated while retaining their nsw property. But the
>> >> > > problem
>> >> > > is
>> >> > > that,
>> >> > > because this freedom cannot be represented in the IR, we also
>> >> > > can't
>> >> > > generate it from IR in a semantics-preserving way; and, thus,
>> >> > > it
>> >> > > would not be
>> >> > > legal to generate it in the vectorizers.
>> >> >
>> >> > As proposed, we are not going to generate it always in
>> >> > vectorizer
>> >> > but
>> >> > only based
>> >> > on target checks, can't we also check for ordering relaxation
>> >> > there?
>> >> > Or fix the ordering
>> >> > constraint by some other means.
>> >>
>> >> No; it is a semantics problem. The vectorizer would need to prove
>> >> something it seems unlikely to be able to prove.
>> >>
>> >>  -Hal
>> >>
>> >> >
>> >> > >
>> >> > > Thus, this change does not seem right, and approving
>> >> > > http://reviews.llvm.org/D10867 seems like a mistake as well.
>> >> > > We
>> >> > > could
>> >> > > certainly fix the definition here to make it exactly
>> >> > > representable
>> >> > > in the IR, but
>> >> > > then what's the point of doing so?
>> >> > >
>> >> > > In the RFC, you mentioned cost modeling as a major motivation
>> >> > > for
>> >> > > adding
>> >> > > intrinsics, but that seems like an unrelated problem (at least
>> >> > > in
>> >> > > part).
>> >> >
>> >> > In RFC, I had also proposed llvm.sad intrinsic which I thought
>> >> > would
>> >> > make cost
>> >> > modelling easier for those targets which support sad operation
>> >> > natively. Hence
>> >> > The motivation. But Renato's explanation clarified the things
>> >> > related
>> >> > to cost
>> >> > modeling. However the complexity of pattern matching was always
>> >> > a
>> >> > bigger
>> >> > issue.
>> >> >
>> >> > > During
>> >> > > vectorization, we can use special interfaces to estimate the
>> >> > > cost
>> >> > > of complex
>> >> > > patterns. In fact, we already have an interface for
>> >> > > reductions:
>> >> > > TTI.getReductionCost. There is a second relevant code model:
>> >> > > That
>> >> > > used by
>> >> > > the unroller and inliner. Vectorization happens after
>> >> > > inlining,
>> >> > > so
>> >> > > that
>> >> > > interaction is not really relevant, but partial unrolling
>> >> > > happens
>> >> > > after
>> >> > > vectorization, and so the cost model there might want to
>> >> > > understand
>> >> > > that a
>> >> > > complex sequence of shuffles, extracts and adds has a
>> >> > > disproportionately-
>> >> > > low cost. The same is true of the inliner if the input IR uses
>> >> > > vector types and
>> >> > > initially contains these operations, but even in that case,
>> >> > > you'd
>> >> > > not want to
>> >> > > canonicalize on the intrinsics too early in case other
>> >> > > optimizations remove the
>> >> > > need for most of the operations. Thus, in the end, you need
>> >> > > pattern-
>> >> > > matching code near the end of the pipeline anyway to account
>> >> > > for
>> >> > > input IR
>> >> > > directly containing relevant operations on vector types.
>> >> > >
>> >> > > In short, I don't understand why we're going in this
>> >> > > direction.
>> >> > > You
>> >> > > can match
>> >> > > these operations in the backend, and you can match them in the
>> >> > > cost-model
>> >> > > code. If we can't do the latter, then we should fix that. And
>> >> > > we
>> >> > > already have
>> >> > > special interfaces for vectorization costs for complex
>> >> > > operations
>> >> > > such as this.
>> >> > >
>> >> > >
>> >> > > http://reviews.llvm.org/D10964
>> >> > >
>> >> > >
>> >> >
>> >> > _______________________________________________
>> >> > llvm-commits mailing list
>> >> > llvm-commits at lists.llvm.org
>> >> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>> >> >
>> >>
>> >> --
>> >> Hal Finkel
>> >> Assistant Computational Scientist
>> >> Leadership Computing Facility
>> >> Argonne National Laboratory
>> >> _______________________________________________
>> >> llvm-commits mailing list
>> >> llvm-commits at lists.llvm.org
>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>> >>
>> >
>> > --
>> > Hal Finkel
>> > Assistant Computational Scientist
>> > Leadership Computing Facility
>> > Argonne National Laboratory
>>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory


More information about the llvm-commits mailing list