[LLVMdev] Horizontal ADD across single vector not profitable in SLP Vectorization

Arnold Schwaighofer aschwaighofer at apple.com
Fri Nov 28 12:43:07 PST 2014


On 11/28/14, suyog sarda   wrote:
> The IR will probably have something like :
> 
> 
> 1. Extract a[0] and put it in vec1 <2 x i32>, 0
> 2. Extract a[1] and put it in vec1 <2 x i32>, 1
> 2. Extract a[2] and put it in vec2 <2 x i32>, 0
> 3. Extract a[3] and put it in vec2 <2 x i32>, 1
> 4. Add vec1 and vec2, sum in vec3 <2 x i32>
> 5. Extract vec3[0] in sum1 
> 6. Extract vec3[1] in sum2
> 7 add sum1 and sum2 in sum3
> 8. return sum3
> 
> 
> So overall instructions - 6 'extractlement', 4 'insertelement', 1 vector add, 1 scalar add and 1 return statement. We have vectorized add operation.

Hi Suyog,

Have a look at the code in HorizontalReduction::getReductionCost and HorizontalReduction::emitReduction.

You don't need 4 extracts. This can be modeled at the IR level as a combination of shufflevector and vector add instruction on a <4 x i32> vector. TargetTransformInfo::getReductionCost can return the appropriate cost (for example, one for AArch64::getReductionCost(add, <4 x i32>)) if codegen can implement this sequence of instructions more efficiently.

For a <4 x i32> reduction you need only need two vector shuffles, two vector adds and one vector extract to get the scalar result.

vadd <0, 1, 2, 3>
         <2, 3, x, x> // shuffled
=>

<0+2, 1+3, x, x>


vadd <0+2, 1+3, x x>
         <1+3, x, x x> // shuffled
=> 

<0+2+1+3, x, x, x>

What it takes to get your example working in the SLPVectorizer is:

* Get the matching code up to snuff. I think, we should replace the depth first search matcher by explicitly matching the trees we expect in HorizontalReduction::matchReduction. The code should just look for:
    
   (+ (+ (+ v1 v2) v3) v4)
    and maybe
    (+ ( + v1 v2) (+ v3 v4))
    
    explicitly for v1, .., vn identical operations.

* Allow a tree of size of one (the vector loads) if the tree feeds a reduction.

* Adjust the cost model AArch64::getReductionCost

* AArch64 CodeGen would have to recognize the shuffle reduction if it does not do so already



Best,
Arnold




More information about the llvm-dev mailing list