[LLVMdev] Horizontal ADD across single vector not profitable in SLP Vectorization

James Molloy james at jamesmolloy.co.uk
Fri Nov 28 13:06:38 PST 2014


Hi Arnold,

Thanks for cc'ing me on this. As we discussed at the devmtg, my personal
view on this is that the reductions might be better represented as an
intrinsic - the matching code is rather complex for the system of shuffles,
is duplicated in all backends and is not particularly robust due to the
complexity of the pattern.

Intrinsics could lower to this pattern if there is no ISA support for a
target- in the meantime it keeps the semantics without allowing later
passes to muck up the matchable pattern.

I have a patch mostly implementing this but it's stuck in my copious
post-devmtg queue (notably with the LNT improvments I promised...)

What's your opinion on this?

Cheers,

James

On Fri, 28 Nov 2014 at 21:00, Arnold Schwaighofer <aschwaighofer at apple.com>
wrote:

> On 11/28/14, suyog sarda   wrote:
> > The IR will probably have something like :
> >
> >
> > 1. Extract a[0] and put it in vec1 <2 x i32>, 0
> > 2. Extract a[1] and put it in vec1 <2 x i32>, 1
> > 2. Extract a[2] and put it in vec2 <2 x i32>, 0
> > 3. Extract a[3] and put it in vec2 <2 x i32>, 1
> > 4. Add vec1 and vec2, sum in vec3 <2 x i32>
> > 5. Extract vec3[0] in sum1
> > 6. Extract vec3[1] in sum2
> > 7 add sum1 and sum2 in sum3
> > 8. return sum3
> >
> >
> > So overall instructions - 6 'extractlement', 4 'insertelement', 1 vector
> add, 1 scalar add and 1 return statement. We have vectorized add operation.
>
> Hi Suyog,
>
> Have a look at the code in HorizontalReduction::getReductionCost and
> HorizontalReduction::emitReduction.
>
> You don't need 4 extracts. This can be modeled at the IR level as a
> combination of shufflevector and vector add instruction on a <4 x i32>
> vector. TargetTransformInfo::getReductionCost can return the appropriate
> cost (for example, one for AArch64::getReductionCost(add, <4 x i32>)) if
> codegen can implement this sequence of instructions more efficiently.
>
> For a <4 x i32> reduction you need only need two vector shuffles, two
> vector adds and one vector extract to get the scalar result.
>
> vadd <0, 1, 2, 3>
>          <2, 3, x, x> // shuffled
> =>
>
> <0+2, 1+3, x, x>
>
>
> vadd <0+2, 1+3, x x>
>          <1+3, x, x x> // shuffled
> =>
>
> <0+2+1+3, x, x, x>
>
> What it takes to get your example working in the SLPVectorizer is:
>
> * Get the matching code up to snuff. I think, we should replace the depth
> first search matcher by explicitly matching the trees we expect in
> HorizontalReduction::matchReduction. The code should just look for:
>
>    (+ (+ (+ v1 v2) v3) v4)
>     and maybe
>     (+ ( + v1 v2) (+ v3 v4))
>
>     explicitly for v1, .., vn identical operations.
>
> * Allow a tree of size of one (the vector loads) if the tree feeds a
> reduction.
>
> * Adjust the cost model AArch64::getReductionCost
>
> * AArch64 CodeGen would have to recognize the shuffle reduction if it does
> not do so already
>
>
>
> Best,
> Arnold
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141128/1afd60b6/attachment.html>


More information about the llvm-dev mailing list