[PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.

Cong Hou via llvm-commits llvm-commits at lists.llvm.org
Wed Nov 25 14:54:08 PST 2015


I think this patch is one of several ones that are for detecting SAD
pattern in loop vectorizer. The patch in http://reviews.llvm.org/D9029
tries to introduce a new operation and an intrinsic for it and I think
this patch is an intermediate step.

However, instead of detecting SAD during loop vectorization, I think
it is feasible to do it during instruction selection. The only missing
information is that we need to know which operation in the loop is a
reduction. After a reduction instruction is vectorized, we actually
don't care how elements are organized in the reduction result vector;
we only need to make sure the reduction of elements in the result is
identical to the reduction of all elements of both operands. This
gives us enough freedom to select better instructions as some
instructions already do partial reduction like psumbw and pmaddwd on
X86.

The problem is how to detect reduction instructions during instruction
selection. My opinion is that we can annotate reduction phis during
loop vectorization using metadata. Then just before lowering, we use
those annotated reduction phis to find real reduction operations (we
need to make sure it is safe to detect reduction instructions from
reduction phis, given that there are many transformations after loop
vectorization), and mark them in SDNodeFlags, which can be checked
during instruction combine.

I've made a patch based on this idea and managed to detect SAD pattern
from either source code or IR:

http://reviews.llvm.org/D14840

Cost model is another concern when detecting SAD pattern. On SSE2, we
can get the most benefit from psadbw with the loop being unrolled at
least 16 times. However, this is impossible now for two reasons:

1. The vectorization factor is now determined by the widest type in
the loop by default. For SAD loop this means VF=4 on SSE2. With the
patch in http://reviews.llvm.org/D8943 we now have a switch to
determine VF using the smallest type.

2. The cost model may prevent the VF 16 being used. This is due to
large costs of promoting/demoting vector of integers. I am working on
this issue now, and the patch that is under review is
http://reviews.llvm.org/D14588. Once it is approved, I will go ahead
and update the X86's cost table on vector conversions. I think this
will let loop vectorizer choose 16 in our case above. We can also
consider to turn on the switch mentioned above by default with
improved cost model.

With the same method we can add more reduction pattern recognition in LLVM.

Comments are welcome!


thanks,
Cong


On Wed, Nov 25, 2015 at 2:29 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> Cong,
>
> Could you please comment on what you think we should do here? It seems like we've now had several threads on this topic, and I'm not sure what the consensus is right now.
>
>  -Hal
>
> ----- Original Message -----
>> From: "Hal Finkel via llvm-commits" <llvm-commits at lists.llvm.org>
>> To: "Asghar-ahmad Shahid" <Asghar-ahmad.Shahid at amd.com>
>> Cc: llvm-commits at lists.llvm.org, reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org
>> Sent: Tuesday, October 27, 2015 11:59:53 PM
>> Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.
>>
>> ----- Original Message -----
>> > From: "Asghar-ahmad via llvm-commits Shahid"
>> > <llvm-commits at lists.llvm.org>
>> > To: reviews+D10964+public+8430e6553a631d12 at reviews.llvm.org,
>> > "renato golin" <renato.golin at linaro.org>, "james molloy"
>> > <james.molloy at arm.com>
>> > Cc: llvm-commits at lists.llvm.org
>> > Sent: Sunday, September 27, 2015 11:42:46 PM
>> > Subject: RE: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
>> > corresponding SDNodes for horizontal sum operation.
>> >
>> > Hi Hal,
>> >
>> > Response inlined.
>> >
>> > Regards,
>> > Shahid
>> >
>> > > -----Original Message-----
>> > > From: hfinkel at anl.gov [mailto:hfinkel at anl.gov]
>> > > Sent: Saturday, September 26, 2015 2:58 AM
>> > > To: Shahid, Asghar-ahmad; renato.golin at linaro.org;
>> > > hfinkel at anl.gov;
>> > > james.molloy at arm.com
>> > > Cc: llvm-dev at redking.me.uk; bruno.cardoso at gmail.com; llvm-
>> > > commits at lists.llvm.org
>> > > Subject: Re: [PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and
>> > > corresponding SDNodes for horizontal sum operation.
>> > >
>> > > hfinkel added a comment.
>> > >
>> > > I need to take a step back here; why are we doing this again?
>> > >
>> > > I read again the RFC thread
>> > > (http://lists.llvm.org/pipermail/llvm-dev/2015-
>> > > May/085078.html), and it ended with the following (from Renato):
>> > >
>> > > > > BTW, now my plan is to just add the two intrinsics for
>> > > > > 'absolute
>> > > difference'
>> > >
>> > > >
>> > >
>> > > > >  and 'horizontal add'.
>> > >
>> > > >
>> > >
>> > > >
>> > >
>> > > > That's ok, as long as they're impossible to represent in plain
>> > > > IR.
>> > >
>> > >
>> > > and I think that we were all in agreement on this point. But now
>> > > I'm not sure
>> > > you've demonstrated the prerequisite.
>> > Earlier to this RFC there was a discussion around a patch
>> > (http://reviews.llvm.org/D9029)
>> > regarding llvm.sad intrinsic which highlights the need to have
>> > llvm.hsum and friend
>> > intrinsics. This RFC was a result of that discussion.
>> >
>> > > The underlying operations here (and
>> > > in http://reviews.llvm.org/D10867) seem like they are
>> > > representable
>> > > using IR
>> > > (as demonstrated by the fact that you provide
>> > > potentially-equivalent IR
>> > > sequences in the documentation), except for the ordering freedom
>> > > here.
>> >
>> > Of course the logical expansion of the underlying operation is
>> > trivially representable
>> > but not the pattern matching in codegen. Actually pattern matching
>> > is
>> > complex and
>> > may well be fragile in codegen due to the long chain of generic IR
>> > to
>> > be matched.
>> > Moreover movements of truncate/extends ops in DAG increase the
>> > complexity of
>> > matching further.
>>
>> It might be fragile, but I suggest trying so that the issues are
>> clear. I don't see how you'd end up with a lot of truncate/extends
>> in the middle of a horizontal add that we'd otherwise be able to
>> match. It also makes the compiler more robust because we'll be able
>> to generate good code from hand-written patterns, not just from
>> auto-vectorized code.
>>
>> >
>> > >
>> > > And this, I fear, is where we run into trouble. The thing that is
>> > > not
>> > > representable in the IR is that the order of operations in the
>> > > horizontal sum is
>> > > undefined, and at the same time, signed overflow is undefined.
>> > > This
>> > > cannot
>> > > be represented in the IR because nsw adds don't reassociate, and
>> > > thus,
>> > > there's no way to represent the sequence of nsw adds such that
>> > > they
>> > > can be
>> > > reassociated while retaining their nsw property. But the problem
>> > > is
>> > > that,
>> > > because this freedom cannot be represented in the IR, we also
>> > > can't
>> > > generate it from IR in a semantics-preserving way; and, thus, it
>> > > would not be
>> > > legal to generate it in the vectorizers.
>> >
>> > As proposed, we are not going to generate it always in vectorizer
>> > but
>> > only based
>> > on target checks, can't we also check for ordering relaxation
>> > there?
>> > Or fix the ordering
>> > constraint by some other means.
>>
>> No; it is a semantics problem. The vectorizer would need to prove
>> something it seems unlikely to be able to prove.
>>
>>  -Hal
>>
>> >
>> > >
>> > > Thus, this change does not seem right, and approving
>> > > http://reviews.llvm.org/D10867 seems like a mistake as well. We
>> > > could
>> > > certainly fix the definition here to make it exactly
>> > > representable
>> > > in the IR, but
>> > > then what's the point of doing so?
>> > >
>> > > In the RFC, you mentioned cost modeling as a major motivation for
>> > > adding
>> > > intrinsics, but that seems like an unrelated problem (at least in
>> > > part).
>> >
>> > In RFC, I had also proposed llvm.sad intrinsic which I thought
>> > would
>> > make cost
>> > modelling easier for those targets which support sad operation
>> > natively. Hence
>> > The motivation. But Renato's explanation clarified the things
>> > related
>> > to cost
>> > modeling. However the complexity of pattern matching was always a
>> > bigger
>> > issue.
>> >
>> > > During
>> > > vectorization, we can use special interfaces to estimate the cost
>> > > of complex
>> > > patterns. In fact, we already have an interface for reductions:
>> > > TTI.getReductionCost. There is a second relevant code model: That
>> > > used by
>> > > the unroller and inliner. Vectorization happens after inlining,
>> > > so
>> > > that
>> > > interaction is not really relevant, but partial unrolling happens
>> > > after
>> > > vectorization, and so the cost model there might want to
>> > > understand
>> > > that a
>> > > complex sequence of shuffles, extracts and adds has a
>> > > disproportionately-
>> > > low cost. The same is true of the inliner if the input IR uses
>> > > vector types and
>> > > initially contains these operations, but even in that case, you'd
>> > > not want to
>> > > canonicalize on the intrinsics too early in case other
>> > > optimizations remove the
>> > > need for most of the operations. Thus, in the end, you need
>> > > pattern-
>> > > matching code near the end of the pipeline anyway to account for
>> > > input IR
>> > > directly containing relevant operations on vector types.
>> > >
>> > > In short, I don't understand why we're going in this direction.
>> > > You
>> > > can match
>> > > these operations in the backend, and you can match them in the
>> > > cost-model
>> > > code. If we can't do the latter, then we should fix that. And we
>> > > already have
>> > > special interfaces for vectorization costs for complex operations
>> > > such as this.
>> > >
>> > >
>> > > http://reviews.llvm.org/D10964
>> > >
>> > >
>> >
>> > _______________________________________________
>> > llvm-commits mailing list
>> > llvm-commits at lists.llvm.org
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>> >
>>
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory


More information about the llvm-commits mailing list