[PATCH] Make GVN more iterative

Fri Sep 12 05:33:12 PDT 2014

Hi Daniel,

I'm a bit stumped here. Could I please ping you for a response to my
previous email?

Cheers,

James

On 5 September 2014 15:08, James Molloy <James.Molloy at arm.com> wrote:

> Hi Daniel,
>
> I’ve been bashing my head on this for a while now without any progress,
> and it’s time to ask for help :(
>
> My approach is to handle LoadInsts in PerformPRE, essentially just handing
> off to performNonLocalLoad to deal with fully and partially redundant
> loads.
>
> This function calls markInstructionForDeletion(), which is fairly obvious
> and needs extra code to delete the marked instructions. Fine. However, I’m
> running into assertion failures during verifyRemoved().
>
> From my reading of the code, there are two datastructures: ValueNumbering
> (mapping Value -> Number) and LeaderTable (mapping Number -> Value).
> markInstructionForDeletion() removes the instruction from ValueNumbering,
> but does NOT remove it (if it exists) from LeaderTable.
>
> So my problem is that after calling out to performNonLocalLoad(), I have
> one or more instructions marked for deletion, that could still be in
> LeaderTable and now their entries in ValueNumbering are erased so I don’t
> have a key to lookup and remove them from LeaderTable!
>
> So really I think I’m missing some contract or invariant here - when
> should a Value be in each of the two tables? When is a value expected to
> have been removed from both tables? Also, something I haven’t yet noticed
> probably due to the tunnel vision of looking at the same piece of code for
> a week is: where are loads and other instructions that are
> markedForDeletion() in the normal GVN pass removed from the leader table?
>
> Cheers,
>
> James
>
> On 13/08/2014 17:35, "Hal Finkel" <hfinkel at anl.gov> wrote:
>
> >----- Original Message -----
> >> From: "Daniel Berlin" <dberlin at dberlin.org>
> >> To: "Hal Finkel" <hfinkel at anl.gov>
> >> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>, "James Molloy"
> >><james.molloy at arm.com>
> >> Sent: Wednesday, August 13, 2014 11:01:22 AM
> >> Subject: Re: [PATCH] Make GVN more iterative
> >>
> >> On Tue, Aug 12, 2014 at 10:48 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> >> > ----- Original Message -----
> >> >> From: "Daniel Berlin" <dberlin at dberlin.org>
> >> >> To: "James Molloy" <james.molloy at arm.com>
> >> >> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>
> >> >> Sent: Tuesday, August 12, 2014 11:26:10 AM
> >> >> Subject: Re: [PATCH] Make GVN more iterative
> >> >>
> >> >> On Tue, Aug 12, 2014 at 2:25 AM, James Molloy
> >> >> <james.molloy at arm.com>
> >> >> wrote:
> >> >> > Hi Daniel,
> >> >> >
> >> >> > The 5% speedup in compile time is almost certainly entirely
> >> >> > noise.
> >> >> > That figure was got from running the LNT suite on a core i7.
> >> >> >
> >> >> > You're right that in this testcase only one load is missed
> >> >> > currently, but that is 50% of the loads in the testcase! The
> >> >> > problem is chains of partially redundant loads. The reduced
> >> >> > testcase that inspired this (taken from 450.soplex, where we
> >> >> > lose
> >> >> > 15% of our runtime due to it!) is:
> >> >> >
> >> >> > double f(int stat, int i, double * restrict * restrict p)
> >> >> > {
> >> >> >     double x;
> >> >> >     switch (stat)
> >> >> >     {
> >> >> >         case 0:
> >> >> >         case 1:
> >> >> >             x = p[0][i] - 1;
> >> >> >             if (x < 0)
> >> >> >                 return x;
> >> >> >         case 2:
> >> >> >             return 3 - p[0][i];
> >> >> >         default:
> >> >> >             return 0;
> >> >> >     }
> >> >> > }
> >> >> >
> >> >> > You sound like an expert on the GVN code, which I certainly am
> >> >> > not.
> >> >> > I've worked with PRE heavily before, but that was in a different
> >> >> > compiler that did not use SSA so the algorithm was totally
> >> >> > different (and GVN didn't exist). Having looked at the LLVM
> >> >> > algorithm, the first (GVN) stage performs PRE of loads, but the
> >> >> > second stage performs PRE of non-loads.
> >> >>
> >> >> Yes.  This is because GVN does not really value number memory, it
> >> >> uses
> >> >> memdep to try to get it to tell whether the loads look the same
> >> >> (which
> >> >> isn't really the same :P).  As such, PRE of memory is performed at
> >> >> the
> >> >> point where it is asking memdep questions (and in fact, is mostly
> >> >> independent of the rest of GVN, except for one small part).
> >> >>
> >> >> As such, it will miss things, like you describe, where you end up
> >> >> with
> >> >> partially redundant loads that interact with partially redundant
> >> >> scalars (and other cases, since the Load PRE does not handle
> >> >> partial
> >> >> availability)
> >> >>
> >> >> >
> >> >> > This is obviously going to result in missing PRE opportunities.
> >> >> > The
> >> >> > example above ends up with a chain where you need to spot the
> >> >> > first load is partially redundant (which GVN duly does), then
> >> >> > spot
> >> >> > that the "sext i32 -> i64" afterwards is partially redundant
> >> >> > (which the second stage PRE duly does), then notice the next
> >> >> > load
> >> >> > is now redundant (woops, we never do load PRE again at this
> >> >> > point!)
> >> >> >
> >> >> > I don't see it as "we're missing just one load", I see it as
> >> >> > "LLVM's implementation of a truly classical compiler
> >> >> > optimization
> >> >> > is really weak".
> >> >>
> >> >> I 100% agree with you, but this is a known issue.  Nobody has had
> >> >> the
> >> >> werewithal to actually solve it, and people keep piling on hacks
> >> >> and
> >> >> bandaids, which is how GCC got so slow in the end.  There has to
> >> >> be a
> >> >> stop loss point or GVN will just get ever slower.
> >> >>
> >> >> > What do you think? Should we implement a new algorithm, or make
> >> >> > it
> >> >> > more iterative?
> >> >> >
> >> >>
> >> >> Realistically, we should change the algorithm. But this is a lot
> >> >> of
> >> >> work. (if you are interested, i can help, and there is even a
> >> >> GVN+PRE
> >> >> implementation in LLVM's sourcetree, if you look at the version
> >> >> control history of Transforms/Scalar)
> >> >>
> >> >> However,  you could go halfway for now.
> >> >> There is nothing, IIRC, that should stop you from updating the
> >> >> availability tables after the scalar PRE, and then just iterating
> >> >> that
> >> >> + load PRE (without the rest of GVN).  The load PRE does not
> >> >> really
> >> >> depend on anything interesting, last i looked.
> >> >>
> >> >> > It only re-iterates if PRE actually does something, so I don't
> >> >> > think this should slow the compiler down massively.
> >> >>
> >> >>
> >> >> 3% on an average testcase is a lot to catch essentially 0.1% more
> >> >> loads ;)
> >> >
> >> > This seems like a good example of something to enable in an
> >> > aggressive optimization mode above what we currently have (a -O4,
> >> > for example).
> >>
> >> I'd agree with this :)
> >>
> >> >  I realize you're joking, but, I think that "0.1% more loads" is
> >> >  likely unfair, we'd actually need to look at what happened in
> >> >  this test case.
> >>
> >> The test case is contrived, of course, so it does 50% there.
> >>
> >> But I did actually run it on some benchmarks i had, and, on average,
> >> it removed single digit more loads per file.
> >> There was no measurable speedup, at least on those benchmarks.  But I
> >> know, when we added similar support to GCC, we did find some real
> >> benchmarks that improved, so i have no doubt it'll make some
> >> difference to people :)
> >>
> >>
> >> > Nevertheless, if hypothetically I could spend 3% more time to get a
> >> > 0.1% speedup,  and I can extrapolate that to spending 3x the time
> >> > to get a 10% speedup, I have many users who would gladly pay that
> >> > price (at least for production builds).
> >>
> >> > Generally speaking, we guard our compile time very carefully, and
> >> > that's a good thing, but we should not exclude useful (and easily
> >> > maintainable)
> >> > optimization improvements from our codebase, even if we don't
> >> > include them in the default -O3 optimization pipeline.
> >>
> >> I'd agree with this.
> >> But, i'm not sure it's quite to the maintainable stage yet, IMHO.
> >> But maybe we differ in opinion on that. I would not, for example,
> >> consider iterating any particular pass repeatedly to be a good
> >> solution when we know of better ways to accomplish the same goal (due
> >> to known deficiencies or whatever).
> >> In this case, i haven't asked him to actually rewrite GVN, i've asked
> >> him to iterate only the part actually causing the optimization, and
> >> maybe at some higher opt level.  I expect this to achieve his goal,
> >> be
> >> smaller, better to maintain, and get us a little farther down the
> >> road
> >> to sanity.
> >
> >Sounds good to me :-)
> >
> >Thanks again,
> >Hal
> >
> >>
> >>
> >> >
> >> >  -Hal
> >> >
> >> >> It's only because this load is important that anyone cares.
> >> >>
> >> >>
> >> >> > I ran GVN over a clang.bc, and got this:
> >> >> >
> >> >> > Before my patch (release+asserts):
> >> >> >   10.1046 ( 47.6%)   0.6320 ( 51.3%)  10.7367 ( 47.8%)  10.7387
> >> >> >   (
> >> >> >   47.7%)  Global Value Numbering
> >> >> >
> >> >> > After my patch (release+asserts):
> >> >> >   10.2886 ( 48.2%)   0.7441 ( 55.9%)  11.0327 ( 48.6%)  11.0000
> >> >> >   (
> >> >> >   48.4%)  Global Value Numbering
> >> >> >
> >> >> > Which is 0.3s (2.8%) worse, which doesn't sound like it should
> >> >> > be a
> >> >> > show-stopper to me! Perhaps you could run it on your own
> >> >> > testcases?
> >> >> >
> >> >> > Cheers,
> >> >> >
> >> >> > James
> >> >> >
> >> >> > -----Original Message-----
> >> >> > From: Daniel Berlin [mailto:dberlin at dberlin.org]
> >> >> > Sent: 11 August 2014 23:49
> >> >> > To: James Molloy
> >> >> > Cc: Owen Anderson (resistor at mac.com); llvm-commits
> >> >> > Subject: Re: [PATCH] Make GVN more iterative
> >> >> >
> >> >> > On Mon, Aug 11, 2014 at 3:25 PM, Daniel Berlin
> >> >> > <dberlin at dberlin.org> wrote:
> >> >> >> We have cases where GVN is *really slow* (IE 29 seconds and 80%
> >> >> >> of
> >> >> >> compile time). Iterating it again is likely to make that worse.
> >> >> >>  Happy
> >> >> >> to give them to you (or test it with this patch)
> >> >> >>
> >> >> >> You say you saw a 5% speedup in compile time, which seems
> >> >> >> really
> >> >> >> odd.
> >> >> >>
> >> >> >> What exactly did you run it on?
> >> >> >>
> >> >> >> Additionally, it kind of sounds like you are saying all this
> >> >> >> does
> >> >> >> is
> >> >> >> remove 1 additional load for this testcase. Do you have more
> >> >> >> general
> >> >> >> performance numbers?
> >> >> >> Iterating all of GVN to eliminate a single load seems like a
> >> >> >> pretty
> >> >> >> heavy hammer.
> >> >> >
> >> >> > To be clear, LLVM used to have a GVNPRE implementation, but it
> >> >> > was
> >> >> > decided this wasn't worth the cost of what it got.
> >> >> > What you are doing is effectively re-adding that, but without an
> >> >> > integrated algorithm that was O(better time bounds).
> >> >> > Thus, this patch, at a glance, seems like the wrong approach.
> >> >> > If we really have a bunch of cases with significant performance
> >> >> > benefits from GVN + PRE, then that would point towards moving
> >> >> > back
> >> >> > towards GVN-PRE, which is why the comment says "   // Actually,
> >> >> > when
> >> >> > this happens, we should just fully integrate PRE into GVN."
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Mon, Aug 11, 2014 at 11:29 AM, James Molloy
> >> >> >> <James.Molloy at arm.com> wrote:
> >> >> >>> Hi all,
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> GVN currently iterates until it can do no more, then performs
> >> >> >>> PRE.
> >> >> >>> There’s a FIXME, saying we should try GVN again if PRE made
> >> >> >>> changes.
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> This patch changes GVN to do this, motivated by the attached
> >> >> >>> testcase
> >> >> >>> (reduced from SPEC) where GVN currently leaves a redundant
> >> >> >>> load.
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> The FIXME mentions memory dependence checking, but it looks to
> >> >> >>> me
> >> >> >>> like the memory dependence updating got implemented after the
> >> >> >>> FIXME
> >> >> >>> was written, so it’s out of date.
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> I’ve tested this for compile time and there are no non-noise
> >> >> >>> regressions (in fact, the geomean was a 5% speedup,
> >> >> >>> interestingly).
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> What do you think?
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> Cheers,
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> James
> >> >> >>>
> >> >> >>>
> >> >> >>> -- IMPORTANT NOTICE: The contents of this email and any
> >> >> >>> attachments
> >> >> >>> are confidential and may also be privileged. If you are not
> >> >> >>> the
> >> >> >>> intended recipient, please notify the sender immediately and
> >> >> >>> do
> >> >> >>> not
> >> >> >>> disclose the contents to any other person, use it for any
> >> >> >>> purpose, or
> >> >> >>> store or copy the information in any medium. Thank you.
> >> >> >>>
> >> >> >>> ARM Limited, Registered office 110 Fulbourn Road, Cambridge
> >> >> >>> CB1
> >> >> >>> 9NJ,
> >> >> >>> Registered in England & Wales, Company No: 2557590 ARM
> >> >> >>> Holdings
> >> >> >>> plc,
> >> >> >>> Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> >> >> >>> Registered in
> >> >> >>> England & Wales, Company No: 2548782
> >> >> >>>
> >> >> >>> _______________________________________________
> >> >> >>> llvm-commits mailing list
> >> >> >>> llvm-commits at cs.uiuc.edu
> >> >> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >> >> >>>
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >>
> >> >> _______________________________________________
> >> >> llvm-commits mailing list
> >> >> llvm-commits at cs.uiuc.edu
> >> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >> >>
> >> >
> >> > --
> >> > Hal Finkel
> >> > Assistant Computational Scientist
> >> > Leadership Computing Facility
> >> > Argonne National Laboratory
> >>
> >
> >--
> >Hal Finkel
> >Assistant Computational Scientist
> >Leadership Computing Facility
> >Argonne National Laboratory
> >
>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium.  Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No:  2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No:  2548782
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140912/9ef16aa6/attachment.html>