RFC: Modeling horizontal vector reductions

Wed Sep 11 15:27:07 PDT 2013

----- Original Message -----
> Hi all,
> 
> 
> I want to model horizontal vector reductions in the cost model and
> vectorizers (in the slp vectorizer is where this matters most
> because the reduction is more likely to become part of a hot code
> region so we want to model the cost more precisely and generate
> efficient code).
> 
> A horizontal reduction can be modeled by a sequence of shufflevector
> and add/fadd/reduction operations modeling the reduction tree.
> 
> In the loop vectorizer we have chosen to use a tree that splits the
> vector in half at every level and adds the halves:
> 
> (v0, v1, v2, v3)
>   \   \  /  /
>     \  \  /
>       +  +
> 
> (v0+v2, v1+v3, undef, undef)
> 
>    \      /
> ((v0+v2) + (v1+v3), undef, undef)
> 
> define fastcc float @reduction_cost_float(<4 x float> %rdx) {
>   %rdx.shuf = shufflevector <4 x float> %rdx, <4 x float> undef, <4 x
>   i32> <i32 2, i32 3, i32 undef, i32 undef>
>   %bin.rdx = fadd <4 x float> %rdx, %rdx.shuf
>   %rdx.shuf7 = shufflevector <4 x float> %bin.rdx, <4 x float> undef,
>   <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
>   %bin.rdx8 = fadd <4 x float> %bin.rdx, %rdx.shuf7
>   %r = extractelement <4 x float> %bin.rdx8, i32 0
>   ret float %r
> }
> 
> However, this form cannot be matched to the shortest sequence of
> instructions on a platform where we have pairwise vector fadds
> (haddps - intel, VPADD - arm, faddp - aarch64) because we don’t have
> fast-math instruction flags at the selection dag level and therefore
> cannot reassociate the tree:
> 
> xmm = haddps(xmm, xmm)
> xmm = haddps(xmm, xmm)
> xmm[0] now contains the reduction result (also [1], [2], [3] but for
> this purpose we don’t care)
> 
> A form that matches to pairwise vector adds is a tree like this:
> 
> (v0, v1, v2, v3)
>  \   /    \  /
> (v0+v1, v2+v3, undef, undef)
>    \     /
> ((v0+v1)+(v2+v3), undef, undef, undef)
> 
> define fastcc float @pairwise_hadd(<4 x float> %rdx) {
>   %rdx.shuf.0.0 = shufflevector <4 x float> %rdx, <4 x float> undef,
>         <4 x i32> <i32 0, i32 2 , i32 undef, i32 undef>
>   %rdx.shuf.0.1 = shufflevector <4 x float> %rdx, <4 x float> undef,
>         <4 x i32> <i32 1, i32 3, i32 undef, i32 undef>
>   %bin.rdx.0 = fadd <4 x float> %rdx.shuf.0.0, %rdx.shuf.0.1
>   %rdx.shuf.1.0 = shufflevector <4 x float> %bin.rdx.0, <4 x float>
>   undef,
>         <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
>   %rdx.shuf.1.1 = shufflevector <4 x float> %bin.rdx.0, <4 x float>
>   undef,
>         <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
>   %bin.rdx.1 = fadd <4 x float> %rdx.shuf.1.0, %rdx.shuf.1.1
>   %r = extractelement <4 x float> %bin.rdx.1, i32 0
>   ret float %r
> }
> 
> 
> Therefore, I would like to model horizontal reductions as either
> versions depending on which is deemed cheaper by the cost model.
> 
> It is a bit unfortunate to not have one canonical form but I don’t
> think this justifies adding fast-math flags to isel (which will
> eventually go away).
> 
> Attached is a patch that adds code for the cost model to illustrate
> how I plan implement this. The SLP vectorizer would use the same
> infrastructure to estimate the cost of reductions (of more than two
> elements) and generate code based on it.
> 
> At a high level I want to add the following api to
> TargetTransformInfo:
> 
>  unsigned getReductionCost(unsigned Opcode, Type *Ty, bool
>  IsPairwise) const;
> 
> Clients can then decided based on the returned cost which reduction
> form they want to use.
> 
> An alternative would be to add horizontal reduction intrinsics but I
> wanted to avoid adding them because there is no need - we have a way
> to semantically express the reduction (just not one).
> 
> What do you think?

I think that this works for me, thanks!

+  /// Pairwise:
+  ///  (v0, v1, v2, v3)
+  ///  ((v0+v1), (v2, v3), undef, undef)
+  /// Split:

You're missing the '+' in between v2 and v3?

 -Hal

> 
> Best,
> Arnold
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory