[PATCH] Make GVN more iterative

Fri Sep 5 07:08:03 PDT 2014

Hi Daniel,

I’ve been bashing my head on this for a while now without any progress,
and it’s time to ask for help :(

My approach is to handle LoadInsts in PerformPRE, essentially just handing
off to performNonLocalLoad to deal with fully and partially redundant
loads.

This function calls markInstructionForDeletion(), which is fairly obvious
and needs extra code to delete the marked instructions. Fine. However, I’m
running into assertion failures during verifyRemoved().

>From my reading of the code, there are two datastructures: ValueNumbering
(mapping Value -> Number) and LeaderTable (mapping Number -> Value).
markInstructionForDeletion() removes the instruction from ValueNumbering,
but does NOT remove it (if it exists) from LeaderTable.

So my problem is that after calling out to performNonLocalLoad(), I have
one or more instructions marked for deletion, that could still be in
LeaderTable and now their entries in ValueNumbering are erased so I don’t
have a key to lookup and remove them from LeaderTable!

So really I think I’m missing some contract or invariant here - when
should a Value be in each of the two tables? When is a value expected to
have been removed from both tables? Also, something I haven’t yet noticed
probably due to the tunnel vision of looking at the same piece of code for
a week is: where are loads and other instructions that are
markedForDeletion() in the normal GVN pass removed from the leader table?

Cheers,

James

On 13/08/2014 17:35, "Hal Finkel" <hfinkel at anl.gov> wrote:

>----- Original Message -----
>> From: "Daniel Berlin" <dberlin at dberlin.org>
>> To: "Hal Finkel" <hfinkel at anl.gov>
>> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>, "James Molloy"
>><james.molloy at arm.com>
>> Sent: Wednesday, August 13, 2014 11:01:22 AM
>> Subject: Re: [PATCH] Make GVN more iterative
>>
>> On Tue, Aug 12, 2014 at 10:48 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>> > ----- Original Message -----
>> >> From: "Daniel Berlin" <dberlin at dberlin.org>
>> >> To: "James Molloy" <james.molloy at arm.com>
>> >> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>
>> >> Sent: Tuesday, August 12, 2014 11:26:10 AM
>> >> Subject: Re: [PATCH] Make GVN more iterative
>> >>
>> >> On Tue, Aug 12, 2014 at 2:25 AM, James Molloy
>> >> <james.molloy at arm.com>
>> >> wrote:
>> >> > Hi Daniel,
>> >> >
>> >> > The 5% speedup in compile time is almost certainly entirely
>> >> > noise.
>> >> > That figure was got from running the LNT suite on a core i7.
>> >> >
>> >> > You're right that in this testcase only one load is missed
>> >> > currently, but that is 50% of the loads in the testcase! The
>> >> > problem is chains of partially redundant loads. The reduced
>> >> > testcase that inspired this (taken from 450.soplex, where we
>> >> > lose
>> >> > 15% of our runtime due to it!) is:
>> >> >
>> >> > double f(int stat, int i, double * restrict * restrict p)
>> >> > {
>> >> >     double x;
>> >> >     switch (stat)
>> >> >     {
>> >> >         case 0:
>> >> >         case 1:
>> >> >             x = p[0][i] - 1;
>> >> >             if (x < 0)
>> >> >                 return x;
>> >> >         case 2:
>> >> >             return 3 - p[0][i];
>> >> >         default:
>> >> >             return 0;
>> >> >     }
>> >> > }
>> >> >
>> >> > You sound like an expert on the GVN code, which I certainly am
>> >> > not.
>> >> > I've worked with PRE heavily before, but that was in a different
>> >> > compiler that did not use SSA so the algorithm was totally
>> >> > different (and GVN didn't exist). Having looked at the LLVM
>> >> > algorithm, the first (GVN) stage performs PRE of loads, but the
>> >> > second stage performs PRE of non-loads.
>> >>
>> >> Yes.  This is because GVN does not really value number memory, it
>> >> uses
>> >> memdep to try to get it to tell whether the loads look the same
>> >> (which
>> >> isn't really the same :P).  As such, PRE of memory is performed at
>> >> the
>> >> point where it is asking memdep questions (and in fact, is mostly
>> >> independent of the rest of GVN, except for one small part).
>> >>
>> >> As such, it will miss things, like you describe, where you end up
>> >> with
>> >> partially redundant loads that interact with partially redundant
>> >> scalars (and other cases, since the Load PRE does not handle
>> >> partial
>> >> availability)
>> >>
>> >> >
>> >> > This is obviously going to result in missing PRE opportunities.
>> >> > The
>> >> > example above ends up with a chain where you need to spot the
>> >> > first load is partially redundant (which GVN duly does), then
>> >> > spot
>> >> > that the "sext i32 -> i64" afterwards is partially redundant
>> >> > (which the second stage PRE duly does), then notice the next
>> >> > load
>> >> > is now redundant (woops, we never do load PRE again at this
>> >> > point!)
>> >> >
>> >> > I don't see it as "we're missing just one load", I see it as
>> >> > "LLVM's implementation of a truly classical compiler
>> >> > optimization
>> >> > is really weak".
>> >>
>> >> I 100% agree with you, but this is a known issue.  Nobody has had
>> >> the
>> >> werewithal to actually solve it, and people keep piling on hacks
>> >> and
>> >> bandaids, which is how GCC got so slow in the end.  There has to
>> >> be a
>> >> stop loss point or GVN will just get ever slower.
>> >>
>> >> > What do you think? Should we implement a new algorithm, or make
>> >> > it
>> >> > more iterative?
>> >> >
>> >>
>> >> Realistically, we should change the algorithm. But this is a lot
>> >> of
>> >> work. (if you are interested, i can help, and there is even a
>> >> GVN+PRE
>> >> implementation in LLVM's sourcetree, if you look at the version
>> >> control history of Transforms/Scalar)
>> >>
>> >> However,  you could go halfway for now.
>> >> There is nothing, IIRC, that should stop you from updating the
>> >> availability tables after the scalar PRE, and then just iterating
>> >> that
>> >> + load PRE (without the rest of GVN).  The load PRE does not
>> >> really
>> >> depend on anything interesting, last i looked.
>> >>
>> >> > It only re-iterates if PRE actually does something, so I don't
>> >> > think this should slow the compiler down massively.
>> >>
>> >>
>> >> 3% on an average testcase is a lot to catch essentially 0.1% more
>> >> loads ;)
>> >
>> > This seems like a good example of something to enable in an
>> > aggressive optimization mode above what we currently have (a -O4,
>> > for example).
>>
>> I'd agree with this :)
>>
>> >  I realize you're joking, but, I think that "0.1% more loads" is
>> >  likely unfair, we'd actually need to look at what happened in
>> >  this test case.
>>
>> The test case is contrived, of course, so it does 50% there.
>>
>> But I did actually run it on some benchmarks i had, and, on average,
>> it removed single digit more loads per file.
>> There was no measurable speedup, at least on those benchmarks.  But I
>> know, when we added similar support to GCC, we did find some real
>> benchmarks that improved, so i have no doubt it'll make some
>> difference to people :)
>>
>>
>> > Nevertheless, if hypothetically I could spend 3% more time to get a
>> > 0.1% speedup,  and I can extrapolate that to spending 3x the time
>> > to get a 10% speedup, I have many users who would gladly pay that
>> > price (at least for production builds).
>>
>> > Generally speaking, we guard our compile time very carefully, and
>> > that's a good thing, but we should not exclude useful (and easily
>> > maintainable)
>> > optimization improvements from our codebase, even if we don't
>> > include them in the default -O3 optimization pipeline.
>>
>> I'd agree with this.
>> But, i'm not sure it's quite to the maintainable stage yet, IMHO.
>> But maybe we differ in opinion on that. I would not, for example,
>> consider iterating any particular pass repeatedly to be a good
>> solution when we know of better ways to accomplish the same goal (due
>> to known deficiencies or whatever).
>> In this case, i haven't asked him to actually rewrite GVN, i've asked
>> him to iterate only the part actually causing the optimization, and
>> maybe at some higher opt level.  I expect this to achieve his goal,
>> be
>> smaller, better to maintain, and get us a little farther down the
>> road
>> to sanity.
>
>Sounds good to me :-)
>
>Thanks again,
>Hal
>
>>
>>
>> >
>> >  -Hal
>> >
>> >> It's only because this load is important that anyone cares.
>> >>
>> >>
>> >> > I ran GVN over a clang.bc, and got this:
>> >> >
>> >> > Before my patch (release+asserts):
>> >> >   10.1046 ( 47.6%)   0.6320 ( 51.3%)  10.7367 ( 47.8%)  10.7387
>> >> >   (
>> >> >   47.7%)  Global Value Numbering
>> >> >
>> >> > After my patch (release+asserts):
>> >> >   10.2886 ( 48.2%)   0.7441 ( 55.9%)  11.0327 ( 48.6%)  11.0000
>> >> >   (
>> >> >   48.4%)  Global Value Numbering
>> >> >
>> >> > Which is 0.3s (2.8%) worse, which doesn't sound like it should
>> >> > be a
>> >> > show-stopper to me! Perhaps you could run it on your own
>> >> > testcases?
>> >> >
>> >> > Cheers,
>> >> >
>> >> > James
>> >> >
>> >> > -----Original Message-----
>> >> > From: Daniel Berlin [mailto:dberlin at dberlin.org]
>> >> > Sent: 11 August 2014 23:49
>> >> > To: James Molloy
>> >> > Cc: Owen Anderson (resistor at mac.com); llvm-commits
>> >> > Subject: Re: [PATCH] Make GVN more iterative
>> >> >
>> >> > On Mon, Aug 11, 2014 at 3:25 PM, Daniel Berlin
>> >> > <dberlin at dberlin.org> wrote:
>> >> >> We have cases where GVN is *really slow* (IE 29 seconds and 80%
>> >> >> of
>> >> >> compile time). Iterating it again is likely to make that worse.
>> >> >>  Happy
>> >> >> to give them to you (or test it with this patch)
>> >> >>
>> >> >> You say you saw a 5% speedup in compile time, which seems
>> >> >> really
>> >> >> odd.
>> >> >>
>> >> >> What exactly did you run it on?
>> >> >>
>> >> >> Additionally, it kind of sounds like you are saying all this
>> >> >> does
>> >> >> is
>> >> >> remove 1 additional load for this testcase. Do you have more
>> >> >> general
>> >> >> performance numbers?
>> >> >> Iterating all of GVN to eliminate a single load seems like a
>> >> >> pretty
>> >> >> heavy hammer.
>> >> >
>> >> > To be clear, LLVM used to have a GVNPRE implementation, but it
>> >> > was
>> >> > decided this wasn't worth the cost of what it got.
>> >> > What you are doing is effectively re-adding that, but without an
>> >> > integrated algorithm that was O(better time bounds).
>> >> > Thus, this patch, at a glance, seems like the wrong approach.
>> >> > If we really have a bunch of cases with significant performance
>> >> > benefits from GVN + PRE, then that would point towards moving
>> >> > back
>> >> > towards GVN-PRE, which is why the comment says "   // Actually,
>> >> > when
>> >> > this happens, we should just fully integrate PRE into GVN."
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Mon, Aug 11, 2014 at 11:29 AM, James Molloy
>> >> >> <James.Molloy at arm.com> wrote:
>> >> >>> Hi all,
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> GVN currently iterates until it can do no more, then performs
>> >> >>> PRE.
>> >> >>> There’s a FIXME, saying we should try GVN again if PRE made
>> >> >>> changes.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> This patch changes GVN to do this, motivated by the attached
>> >> >>> testcase
>> >> >>> (reduced from SPEC) where GVN currently leaves a redundant
>> >> >>> load.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> The FIXME mentions memory dependence checking, but it looks to
>> >> >>> me
>> >> >>> like the memory dependence updating got implemented after the
>> >> >>> FIXME
>> >> >>> was written, so it’s out of date.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> I’ve tested this for compile time and there are no non-noise
>> >> >>> regressions (in fact, the geomean was a 5% speedup,
>> >> >>> interestingly).
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> What do you think?
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Cheers,
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> James
>> >> >>>
>> >> >>>
>> >> >>> -- IMPORTANT NOTICE: The contents of this email and any
>> >> >>> attachments
>> >> >>> are confidential and may also be privileged. If you are not
>> >> >>> the
>> >> >>> intended recipient, please notify the sender immediately and
>> >> >>> do
>> >> >>> not
>> >> >>> disclose the contents to any other person, use it for any
>> >> >>> purpose, or
>> >> >>> store or copy the information in any medium. Thank you.
>> >> >>>
>> >> >>> ARM Limited, Registered office 110 Fulbourn Road, Cambridge
>> >> >>> CB1
>> >> >>> 9NJ,
>> >> >>> Registered in England & Wales, Company No: 2557590 ARM
>> >> >>> Holdings
>> >> >>> plc,
>> >> >>> Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>> >> >>> Registered in
>> >> >>> England & Wales, Company No: 2548782
>> >> >>>
>> >> >>> _______________________________________________
>> >> >>> llvm-commits mailing list
>> >> >>> llvm-commits at cs.uiuc.edu
>> >> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> >> >>>
>> >> >
>> >> >
>> >> >
>> >> >
>> >>
>> >> _______________________________________________
>> >> llvm-commits mailing list
>> >> llvm-commits at cs.uiuc.edu
>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> >>
>> >
>> > --
>> > Hal Finkel
>> > Assistant Computational Scientist
>> > Leadership Computing Facility
>> > Argonne National Laboratory
>>
>
>--
>Hal Finkel
>Assistant Computational Scientist
>Leadership Computing Facility
>Argonne National Laboratory
>

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2548782