RFC: Modeling horizontal vector reductions

Wed Sep 11 17:17:41 PDT 2013

On Sep 11, 2013, at 6:24 PM, Arnold Schwaighofer <aschwaighofer at apple.com> wrote:

> 
> On Sep 11, 2013, at 5:54 PM, Chandler Carruth <chandlerc at google.com> wrote:
> 
>> On Wed, Sep 11, 2013 at 3:49 PM, Arnold Schwaighofer <aschwaighofer at apple.com> wrote:
>> 
>> On Sep 11, 2013, at 5:30 PM, Chandler Carruth <chandlerc at google.com> wrote:
>> 
>>> 
>>> On Wed, Sep 11, 2013 at 3:17 PM, Arnold Schwaighofer <aschwaighofer at apple.com> wrote:
>>> Therefore, I would like to model horizontal reductions as either versions depending on which is deemed cheaper by the cost model.
>>> 
>>> What would make the first pattern cheaper? I'd like to better understand why we don't just all ways do the second form…
>> 
>> Less shuffles (because shuffle vec, <0,1, undef, undef> is free) so when you don’t have pairwise vector operations the first pattern is preferable.
>> 
>> I thought so, but thanks for confirming. I don't trust myself entirely on the cost models here.
>> 
>>> It is a bit unfortunate to not have one canonical form but I don’t think this justifies adding fast-math flags to isel (which will eventually go away).
>>> 
>>> I don't really understand this part.
>>> 
>>> We have some reason at the IR level to know that we can choose either association and get equivalent results. Why isn't the correct answer to pick a canonical form, but preserve that information long enough to reassociate when it is needed?
>> 
>> We want to pick the right form at ISel time - which is too late to reassociate.
>> 
>> At some point - before ISel - we have to reasscociate because at ISel time we don’t have the fast-math flags that would tell us that it is legal to reassociate.
>> 
>> So, we could for example reassociate in CodeGen prepare. We would still need an interface to tell us when to do so.
>> 
>> Why don't you want to propagate the flags through isel? That really seems like the correct long-term solution: that ISel looks at the pattern, knows that it would be cheaper to use the horizontal instructions and emits the code that way. It doesn't even have to actually *do* the reassociation, it can match the reduction pattern and implicitly re-associate by forming the horizontal instruction pattern. It just needs to know that this is allowed.
> 
> Yes sure. You only actually have to reassociate if you don’t have the flags.
> 
> 
>> You mentioned not wanting to thread these flags through because some of the machinery is slowly going away, but I think that "slowly" is going to be a *lot* of time. I think threading the flags through is a much better interim cost (fixed, no design overhead) than having 2 patterns in the IR for the same vector operation

I don’t think there is an extra cost that I would incur. If we don’t have unsafe-math flags we want to generate either of those two patterns anyways:

Say somebody really wrote:

v0 = 7 * A[i];
v1 = 7 * A[i+1];
v2 = 7 * A[i+2];
v3 = 7 * A[i+3];
r += (v0 + v1) +  (v2 + v3);

VS.

v0 = 7 * A[i];
v1 = 7 * A[i+1];
v2 = 7 * A[i+2];
v3 = 7 * A[i+3];
r += (v0 + v2) +  (v1 + v3);

In this case the order dictates which pattern to use. It is just in the fast-math case that the order does not matter.