[PATCH] Make GVN more iterative

Tue Aug 12 11:04:51 PDT 2014

On Tue, Aug 12, 2014 at 9:26 AM, Daniel Berlin <dberlin at dberlin.org> wrote:
> On Tue, Aug 12, 2014 at 2:25 AM, James Molloy <james.molloy at arm.com> wrote:
>> Hi Daniel,
>>
>> The 5% speedup in compile time is almost certainly entirely noise. That figure was got from running the LNT suite on a core i7.
>>
>> You're right that in this testcase only one load is missed currently, but that is 50% of the loads in the testcase! The problem is chains of partially redundant loads. The reduced testcase that inspired this (taken from 450.soplex, where we lose 15% of our runtime due to it!) is:
>>
>> double f(int stat, int i, double * restrict * restrict p)
>> {
>>     double x;
>>     switch (stat)
>>     {
>>         case 0:
>>         case 1:
>>             x = p[0][i] - 1;
>>             if (x < 0)
>>                 return x;
>>         case 2:
>>             return 3 - p[0][i];
>>         default:
>>             return 0;
>>     }
>> }
>>
>> You sound like an expert on the GVN code, which I certainly am not. I've worked with PRE heavily before, but that was in a different compiler that did not use SSA so the algorithm was totally different (and GVN didn't exist). Having looked at the LLVM algorithm, the first (GVN) stage performs PRE of loads, but the second stage performs PRE of non-loads.
>
> Yes.  This is because GVN does not really value number memory, it uses
> memdep to try to get it to tell whether the loads look the same (which
> isn't really the same :P).  As such, PRE of memory is performed at the
> point where it is asking memdep questions (and in fact, is mostly
> independent of the rest of GVN, except for one small part).
>
> As such, it will miss things, like you describe, where you end up with
> partially redundant loads that interact with partially redundant
> scalars (and other cases, since the Load PRE does not handle partial
> availability)
>
>>
>> This is obviously going to result in missing PRE opportunities. The example above ends up with a chain where you need to spot the first load is partially redundant (which GVN duly does), then spot that the "sext i32 -> i64" afterwards is partially redundant (which the second stage PRE duly does), then notice the next load is now redundant (woops, we never do load PRE again at this point!)
>>
>> I don't see it as "we're missing just one load", I see it as "LLVM's implementation of a truly classical compiler optimization is really weak".
>
> I 100% agree with you, but this is a known issue.  Nobody has had the
> werewithal to actually solve it, and people keep piling on hacks and
> bandaids, which is how GCC got so slow in the end.  There has to be a
> stop loss point or GVN will just get ever slower.
>
>> What do you think? Should we implement a new algorithm, or make it more iterative?
>>
>
> Realistically, we should change the algorithm. But this is a lot of
> work. (if you are interested, i can help, and there is even a GVN+PRE
> implementation in LLVM's sourcetree, if you look at the version
> control history of Transforms/Scalar)
>
> However,  you could go halfway for now.
> There is nothing, IIRC, that should stop you from updating the
> availability tables after the scalar PRE, and then just iterating that
> + load PRE (without the rest of GVN).  The load PRE does not really
> depend on anything interesting, last i looked.
>

To be 100% clear:

You will not be able to remove load pre from it's current place.
Because GVN does not value number memory, it depends heavily on being
able to eliminate loads directly during GVN in order to catch a lot of
things.

You can, however, make it run load PRE after PRE (or integrate PRE
directly into GVN).

This may not actually be faster (GVN spends most of it's time in
memdep on a lot of cases, processing non-local loads), but it will be
cleaner to rewrite in the future, and should get what you want.