RFC: Modeling horizontal vector reductions

Thu Sep 12 08:28:59 PDT 2013

On Sep 12, 2013, at 10:18 AM, Stephen Canon <scanon at apple.com> wrote:

> On Sep 11, 2013, at 3:17 PM, Arnold Schwaighofer <aschwaighofer at apple.com> wrote:
> 
>> However, this form cannot be matched to the shortest sequence of instructions on a platform where we have pairwise vector fadds (haddps - intel, VPADD - arm, faddp - aarch64) because we don’t have fast-math instruction flags at the selection dag level and therefore cannot reassociate the tree:
> 
> The general thrust of this work is great.  I do want to point out that using HADDPS is a codesize win only for horizontal reductions; the fastest idiom is actually two shuffles + two adds (reciprocal throughput of 2 cycles vs. 4 for HADDPS).  So we really do want to have a means to generate both of these, not only the pairwise ops.
> 
> – Steve

Yup, depending on cpu and isa either form might be preferable. Notice my careful wording “shortest sequence of instructions” ;). It will depend on the architecture which form is ultimately preferable. The cost model will guide us which one to choose.