[PATCH] Make GVN more iterative

Wed Aug 13 09:01:22 PDT 2014

On Tue, Aug 12, 2014 at 10:48 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
>> From: "Daniel Berlin" <dberlin at dberlin.org>
>> To: "James Molloy" <james.molloy at arm.com>
>> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>
>> Sent: Tuesday, August 12, 2014 11:26:10 AM
>> Subject: Re: [PATCH] Make GVN more iterative
>>
>> On Tue, Aug 12, 2014 at 2:25 AM, James Molloy <james.molloy at arm.com>
>> wrote:
>> > Hi Daniel,
>> >
>> > The 5% speedup in compile time is almost certainly entirely noise.
>> > That figure was got from running the LNT suite on a core i7.
>> >
>> > You're right that in this testcase only one load is missed
>> > currently, but that is 50% of the loads in the testcase! The
>> > problem is chains of partially redundant loads. The reduced
>> > testcase that inspired this (taken from 450.soplex, where we lose
>> > 15% of our runtime due to it!) is:
>> >
>> > double f(int stat, int i, double * restrict * restrict p)
>> > {
>> >     double x;
>> >     switch (stat)
>> >     {
>> >         case 0:
>> >         case 1:
>> >             x = p[0][i] - 1;
>> >             if (x < 0)
>> >                 return x;
>> >         case 2:
>> >             return 3 - p[0][i];
>> >         default:
>> >             return 0;
>> >     }
>> > }
>> >
>> > You sound like an expert on the GVN code, which I certainly am not.
>> > I've worked with PRE heavily before, but that was in a different
>> > compiler that did not use SSA so the algorithm was totally
>> > different (and GVN didn't exist). Having looked at the LLVM
>> > algorithm, the first (GVN) stage performs PRE of loads, but the
>> > second stage performs PRE of non-loads.
>>
>> Yes.  This is because GVN does not really value number memory, it
>> uses
>> memdep to try to get it to tell whether the loads look the same
>> (which
>> isn't really the same :P).  As such, PRE of memory is performed at
>> the
>> point where it is asking memdep questions (and in fact, is mostly
>> independent of the rest of GVN, except for one small part).
>>
>> As such, it will miss things, like you describe, where you end up
>> with
>> partially redundant loads that interact with partially redundant
>> scalars (and other cases, since the Load PRE does not handle partial
>> availability)
>>
>> >
>> > This is obviously going to result in missing PRE opportunities. The
>> > example above ends up with a chain where you need to spot the
>> > first load is partially redundant (which GVN duly does), then spot
>> > that the "sext i32 -> i64" afterwards is partially redundant
>> > (which the second stage PRE duly does), then notice the next load
>> > is now redundant (woops, we never do load PRE again at this
>> > point!)
>> >
>> > I don't see it as "we're missing just one load", I see it as
>> > "LLVM's implementation of a truly classical compiler optimization
>> > is really weak".
>>
>> I 100% agree with you, but this is a known issue.  Nobody has had the
>> werewithal to actually solve it, and people keep piling on hacks and
>> bandaids, which is how GCC got so slow in the end.  There has to be a
>> stop loss point or GVN will just get ever slower.
>>
>> > What do you think? Should we implement a new algorithm, or make it
>> > more iterative?
>> >
>>
>> Realistically, we should change the algorithm. But this is a lot of
>> work. (if you are interested, i can help, and there is even a GVN+PRE
>> implementation in LLVM's sourcetree, if you look at the version
>> control history of Transforms/Scalar)
>>
>> However,  you could go halfway for now.
>> There is nothing, IIRC, that should stop you from updating the
>> availability tables after the scalar PRE, and then just iterating
>> that
>> + load PRE (without the rest of GVN).  The load PRE does not really
>> depend on anything interesting, last i looked.
>>
>> > It only re-iterates if PRE actually does something, so I don't
>> > think this should slow the compiler down massively.
>>
>>
>> 3% on an average testcase is a lot to catch essentially 0.1% more
>> loads ;)
>
> This seems like a good example of something to enable in an aggressive optimization mode above what we currently have (a -O4, for example).

I'd agree with this :)

>  I realize you're joking, but, I think that "0.1% more loads" is likely unfair, we'd actually need to look at what happened in this test case.

The test case is contrived, of course, so it does 50% there.

But I did actually run it on some benchmarks i had, and, on average,
it removed single digit more loads per file.
There was no measurable speedup, at least on those benchmarks.  But I
know, when we added similar support to GCC, we did find some real
benchmarks that improved, so i have no doubt it'll make some
difference to people :)

> Nevertheless, if hypothetically I could spend 3% more time to get a 0.1% speedup,  and I can extrapolate that to spending 3x the time to get a 10% speedup, I have many users who would gladly pay that price (at least for production builds).

> Generally speaking, we guard our compile time very carefully, and that's a good thing, but we should not exclude useful (and easily maintainable)
> optimization improvements from our codebase, even if we don't include them in the default -O3 optimization pipeline.

I'd agree with this.
But, i'm not sure it's quite to the maintainable stage yet, IMHO.
But maybe we differ in opinion on that. I would not, for example,
consider iterating any particular pass repeatedly to be a good
solution when we know of better ways to accomplish the same goal (due
to known deficiencies or whatever).
In this case, i haven't asked him to actually rewrite GVN, i've asked
him to iterate only the part actually causing the optimization, and
maybe at some higher opt level.  I expect this to achieve his goal, be
smaller, better to maintain, and get us a little farther down the road
to sanity.

>
>  -Hal
>
>> It's only because this load is important that anyone cares.
>>
>>
>> > I ran GVN over a clang.bc, and got this:
>> >
>> > Before my patch (release+asserts):
>> >   10.1046 ( 47.6%)   0.6320 ( 51.3%)  10.7367 ( 47.8%)  10.7387 (
>> >   47.7%)  Global Value Numbering
>> >
>> > After my patch (release+asserts):
>> >   10.2886 ( 48.2%)   0.7441 ( 55.9%)  11.0327 ( 48.6%)  11.0000 (
>> >   48.4%)  Global Value Numbering
>> >
>> > Which is 0.3s (2.8%) worse, which doesn't sound like it should be a
>> > show-stopper to me! Perhaps you could run it on your own
>> > testcases?
>> >
>> > Cheers,
>> >
>> > James
>> >
>> > -----Original Message-----
>> > From: Daniel Berlin [mailto:dberlin at dberlin.org]
>> > Sent: 11 August 2014 23:49
>> > To: James Molloy
>> > Cc: Owen Anderson (resistor at mac.com); llvm-commits
>> > Subject: Re: [PATCH] Make GVN more iterative
>> >
>> > On Mon, Aug 11, 2014 at 3:25 PM, Daniel Berlin
>> > <dberlin at dberlin.org> wrote:
>> >> We have cases where GVN is *really slow* (IE 29 seconds and 80% of
>> >> compile time). Iterating it again is likely to make that worse.
>> >>  Happy
>> >> to give them to you (or test it with this patch)
>> >>
>> >> You say you saw a 5% speedup in compile time, which seems really
>> >> odd.
>> >>
>> >> What exactly did you run it on?
>> >>
>> >> Additionally, it kind of sounds like you are saying all this does
>> >> is
>> >> remove 1 additional load for this testcase. Do you have more
>> >> general
>> >> performance numbers?
>> >> Iterating all of GVN to eliminate a single load seems like a
>> >> pretty
>> >> heavy hammer.
>> >
>> > To be clear, LLVM used to have a GVNPRE implementation, but it was
>> > decided this wasn't worth the cost of what it got.
>> > What you are doing is effectively re-adding that, but without an
>> > integrated algorithm that was O(better time bounds).
>> > Thus, this patch, at a glance, seems like the wrong approach.
>> > If we really have a bunch of cases with significant performance
>> > benefits from GVN + PRE, then that would point towards moving back
>> > towards GVN-PRE, which is why the comment says "   // Actually,
>> > when
>> > this happens, we should just fully integrate PRE into GVN."
>> >
>> >
>> >
>> >
>> >
>> >>
>> >>
>> >>
>> >> On Mon, Aug 11, 2014 at 11:29 AM, James Molloy
>> >> <James.Molloy at arm.com> wrote:
>> >>> Hi all,
>> >>>
>> >>>
>> >>>
>> >>> GVN currently iterates until it can do no more, then performs
>> >>> PRE.
>> >>> There’s a FIXME, saying we should try GVN again if PRE made
>> >>> changes.
>> >>>
>> >>>
>> >>>
>> >>> This patch changes GVN to do this, motivated by the attached
>> >>> testcase
>> >>> (reduced from SPEC) where GVN currently leaves a redundant load.
>> >>>
>> >>>
>> >>>
>> >>> The FIXME mentions memory dependence checking, but it looks to me
>> >>> like the memory dependence updating got implemented after the
>> >>> FIXME
>> >>> was written, so it’s out of date.
>> >>>
>> >>>
>> >>>
>> >>> I’ve tested this for compile time and there are no non-noise
>> >>> regressions (in fact, the geomean was a 5% speedup,
>> >>> interestingly).
>> >>>
>> >>>
>> >>>
>> >>> What do you think?
>> >>>
>> >>>
>> >>>
>> >>> Cheers,
>> >>>
>> >>>
>> >>>
>> >>> James
>> >>>
>> >>>
>> >>> -- IMPORTANT NOTICE: The contents of this email and any
>> >>> attachments
>> >>> are confidential and may also be privileged. If you are not the
>> >>> intended recipient, please notify the sender immediately and do
>> >>> not
>> >>> disclose the contents to any other person, use it for any
>> >>> purpose, or
>> >>> store or copy the information in any medium. Thank you.
>> >>>
>> >>> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
>> >>> 9NJ,
>> >>> Registered in England & Wales, Company No: 2557590 ARM Holdings
>> >>> plc,
>> >>> Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>> >>> Registered in
>> >>> England & Wales, Company No: 2548782
>> >>>
>> >>> _______________________________________________
>> >>> llvm-commits mailing list
>> >>> llvm-commits at cs.uiuc.edu
>> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> >>>
>> >
>> >
>> >
>> >
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory