Hi Arnold,<br><br>Thanks for cc'ing me on this. As we discussed at the devmtg, my personal view on this is that the reductions might be better represented as an intrinsic - the matching code is rather complex for the system of shuffles, is duplicated in all backends and is not particularly robust due to the complexity of the pattern. <br><br>Intrinsics could lower to this pattern if there is no ISA support for a target- in the meantime it keeps the semantics without allowing later passes to muck up the matchable pattern. <br><br>I have a patch mostly implementing this but it's stuck in my copious post-devmtg queue (notably with the LNT improvments I promised...)<br><br>What's your opinion on this?<br><br>Cheers,<br><br>James<br><br><div class="gmail_quote">On Fri, 28 Nov 2014 at 21:00, Arnold Schwaighofer <<a href="mailto:aschwaighofer@apple.com">aschwaighofer@apple.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 11/28/14, suyog sarda   wrote:<br>

> The IR will probably have something like :<br>

><br>

><br>

> 1. Extract a[0] and put it in vec1 <2 x i32>, 0<br>

> 2. Extract a[1] and put it in vec1 <2 x i32>, 1<br>

> 2. Extract a[2] and put it in vec2 <2 x i32>, 0<br>

> 3. Extract a[3] and put it in vec2 <2 x i32>, 1<br>

> 4. Add vec1 and vec2, sum in vec3 <2 x i32><br>

> 5. Extract vec3[0] in sum1<br>

> 6. Extract vec3[1] in sum2<br>

> 7 add sum1 and sum2 in sum3<br>

> 8. return sum3<br>

><br>

><br>

> So overall instructions - 6 'extractlement', 4 'insertelement', 1 vector add, 1 scalar add and 1 return statement. We have vectorized add operation.<br>

<br>

Hi Suyog,<br>

<br>

Have a look at the code in HorizontalReduction::<u></u>getReductionCost and HorizontalReduction::<u></u>emitReduction.<br>

<br>

You don't need 4 extracts. This can be modeled at the IR level as a combination of shufflevector and vector add instruction on a <4 x i32> vector. TargetTransformInfo::<u></u>getReductionCost can return the appropriate cost (for example, one for AArch64::getReductionCost(add, <4 x i32>)) if codegen can implement this sequence of instructions more efficiently.<br>

<br>

For a <4 x i32> reduction you need only need two vector shuffles, two vector adds and one vector extract to get the scalar result.<br>

<br>

vadd <0, 1, 2, 3><br>

         <2, 3, x, x> // shuffled<br>

=><br>

<br>

<0+2, 1+3, x, x><br>

<br>

<br>

vadd <0+2, 1+3, x x><br>

         <1+3, x, x x> // shuffled<br>

=> <br>

<br>

<0+2+1+3, x, x, x><br>

<br>

What it takes to get your example working in the SLPVectorizer is:<br>

<br>

* Get the matching code up to snuff. I think, we should replace the depth first search matcher by explicitly matching the trees we expect in HorizontalReduction::<u></u>matchReduction. The code should just look for:<br>

    <br>

   (+ (+ (+ v1 v2) v3) v4)<br>

    and maybe<br>

    (+ ( + v1 v2) (+ v3 v4))<br>

    <br>

    explicitly for v1, .., vn identical operations.<br>

<br>

* Allow a tree of size of one (the vector loads) if the tree feeds a reduction.<br>

<br>

* Adjust the cost model AArch64::<u></u>getReductionCost<br>

<br>

* AArch64 CodeGen would have to recognize the shuffle reduction if it does not do so already<br>

<br>

<br>

<br>

Best,<br>

Arnold<br>

</blockquote></div>