RFC: Modeling horizontal vector reductions

Wed Sep 11 15:17:55 PDT 2013

Hi all,

I want to model horizontal vector reductions in the cost model and vectorizers (in the slp vectorizer is where this matters most because the reduction is more likely to become part of a hot code region so we want to model the cost more precisely and generate efficient code).

A horizontal reduction can be modeled by a sequence of shufflevector and add/fadd/reduction operations modeling the reduction tree.

In the loop vectorizer we have chosen to use a tree that splits the vector in half at every level and adds the halves:

(v0, v1, v2, v3)
  \   \  /  /
    \  \  /
      +  +

(v0+v2, v1+v3, undef, undef)

   \      /
((v0+v2) + (v1+v3), undef, undef)

define fastcc float @reduction_cost_float(<4 x float> %rdx) {
  %rdx.shuf = shufflevector <4 x float> %rdx, <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
  %bin.rdx = fadd <4 x float> %rdx, %rdx.shuf
  %rdx.shuf7 = shufflevector <4 x float> %bin.rdx, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %bin.rdx8 = fadd <4 x float> %bin.rdx, %rdx.shuf7
  %r = extractelement <4 x float> %bin.rdx8, i32 0
  ret float %r
}

However, this form cannot be matched to the shortest sequence of instructions on a platform where we have pairwise vector fadds (haddps - intel, VPADD - arm, faddp - aarch64) because we don’t have fast-math instruction flags at the selection dag level and therefore cannot reassociate the tree:

xmm = haddps(xmm, xmm)
xmm = haddps(xmm, xmm)
xmm[0] now contains the reduction result (also [1], [2], [3] but for this purpose we don’t care)

A form that matches to pairwise vector adds is a tree like this:

(v0, v1, v2, v3)
 \   /    \  /
(v0+v1, v2+v3, undef, undef)
   \     /
((v0+v1)+(v2+v3), undef, undef, undef)

define fastcc float @pairwise_hadd(<4 x float> %rdx) {
  %rdx.shuf.0.0 = shufflevector <4 x float> %rdx, <4 x float> undef,
        <4 x i32> <i32 0, i32 2 , i32 undef, i32 undef>
  %rdx.shuf.0.1 = shufflevector <4 x float> %rdx, <4 x float> undef,
        <4 x i32> <i32 1, i32 3, i32 undef, i32 undef>
  %bin.rdx.0 = fadd <4 x float> %rdx.shuf.0.0, %rdx.shuf.0.1
  %rdx.shuf.1.0 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef,
        <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
  %rdx.shuf.1.1 = shufflevector <4 x float> %bin.rdx.0, <4 x float> undef,
        <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %bin.rdx.1 = fadd <4 x float> %rdx.shuf.1.0, %rdx.shuf.1.1
  %r = extractelement <4 x float> %bin.rdx.1, i32 0
  ret float %r
}

Therefore, I would like to model horizontal reductions as either versions depending on which is deemed cheaper by the cost model.

It is a bit unfortunate to not have one canonical form but I don’t think this justifies adding fast-math flags to isel (which will eventually go away).

Attached is a patch that adds code for the cost model to illustrate how I plan implement this. The SLP vectorizer would use the same infrastructure to estimate the cost of reductions (of more than two elements) and generate code based on it.

At a high level I want to add the following api to TargetTransformInfo:

 unsigned getReductionCost(unsigned Opcode, Type *Ty, bool IsPairwise) const;

Clients can then decided based on the returned cost which reduction form they want to use.

An alternative would be to add horizontal reduction intrinsics but I wanted to avoid adding them because there is no need - we have a way to semantically express the reduction (just not one).

What do you think?

Best,
Arnold

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Costmodel-Add-support-for-horizontal-vector-reductio.patch
Type: application/octet-stream
Size: 21784 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130911/d1fc3559/attachment.obj>