RFC: Modeling horizontal vector reductions
Hal Finkel
hfinkel at anl.gov
Wed Sep 11 15:27:07 PDT 2013
----- Original Message -----
> Hi all,
>
>
> I want to model horizontal vector reductions in the cost model and
> vectorizers (in the slp vectorizer is where this matters most
> because the reduction is more likely to become part of a hot code
> region so we want to model the cost more precisely and generate
> efficient code).
>
> A horizontal reduction can be modeled by a sequence of shufflevector
> and add/fadd/reduction operations modeling the reduction tree.
>
> In the loop vectorizer we have chosen to use a tree that splits the
> vector in half at every level and adds the halves:
>
> (v0, v1, v2, v3)
> \ \ / /
> \ \ /
> + +
>
> (v0+v2, v1+v3, undef, undef)
>
> \ /
> ((v0+v2) + (v1+v3), undef, undef)
>
> define fastcc float @reduction_cost_float(<4 x float> %rdx) {
> %rdx.shuf = shufflevector <4 x float> %rdx, <4 x float> undef, <4 x
> i32> <i32 2, i32 3, i32 undef, i32 undef>
> %bin.rdx = fadd <4 x float> %rdx, %rdx.shuf
> %rdx.shuf7 = shufflevector <4 x float> %bin.rdx, <4 x float> undef,
> <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
> %bin.rdx8 = fadd <4 x float> %bin.rdx, %rdx.shuf7
> %r = extractelement <4 x float> %bin.rdx8, i32 0
> ret float %r
> }
>
> However, this form cannot be matched to the shortest sequence of
> instructions on a platform where we have pairwise vector fadds
> (haddps - intel, VPADD - arm, faddp - aarch64) because we don’t have
> fast-math instruction flags at the selection dag level and therefore
> cannot reassociate the tree:
>
> xmm = haddps(xmm, xmm)
> xmm = haddps(xmm, xmm)
> xmm[0] now contains the reduction result (also [1], [2], [3] but for
> this purpose we don’t care)
>
> A form that matches to pairwise vector adds is a tree like this:
>
> (v0, v1, v2, v3)
> \ / \ /
> (v0+v1, v2+v3, undef, undef)
> \ /
> ((v0+v1)+(v2+v3), undef, undef, undef)
>
> define fastcc float @pairwise_hadd(<4 x float> %rdx) {
> %rdx.shuf.0.0 = shufflevector <4 x float> %rdx, <4 x float> undef,
> <4 x i32> <i32 0, i32 2 , i32 undef, i32 undef>
> %rdx.shuf.0.1 = shufflevector <4 x float> %rdx, <4 x float> undef,
> <4 x i32> <i32 1, i32 3, i32 undef, i32 undef>
> %bin.rdx.0 = fadd <4 x float> %rdx.shuf.0.0, %rdx.shuf.0.1
> %rdx.shuf.1.0 = shufflevector <4 x float> %bin.rdx.0, <4 x float>
> undef,
> <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
> %rdx.shuf.1.1 = shufflevector <4 x float> %bin.rdx.0, <4 x float>
> undef,
> <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
> %bin.rdx.1 = fadd <4 x float> %rdx.shuf.1.0, %rdx.shuf.1.1
> %r = extractelement <4 x float> %bin.rdx.1, i32 0
> ret float %r
> }
>
>
> Therefore, I would like to model horizontal reductions as either
> versions depending on which is deemed cheaper by the cost model.
>
> It is a bit unfortunate to not have one canonical form but I don’t
> think this justifies adding fast-math flags to isel (which will
> eventually go away).
>
> Attached is a patch that adds code for the cost model to illustrate
> how I plan implement this. The SLP vectorizer would use the same
> infrastructure to estimate the cost of reductions (of more than two
> elements) and generate code based on it.
>
> At a high level I want to add the following api to
> TargetTransformInfo:
>
> unsigned getReductionCost(unsigned Opcode, Type *Ty, bool
> IsPairwise) const;
>
> Clients can then decided based on the returned cost which reduction
> form they want to use.
>
> An alternative would be to add horizontal reduction intrinsics but I
> wanted to avoid adding them because there is no need - we have a way
> to semantically express the reduction (just not one).
>
> What do you think?
I think that this works for me, thanks!
+ /// Pairwise:
+ /// (v0, v1, v2, v3)
+ /// ((v0+v1), (v2, v3), undef, undef)
+ /// Split:
You're missing the '+' in between v2 and v3?
-Hal
>
> Best,
> Arnold
>
>
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
More information about the llvm-commits
mailing list