[LLVMdev] Separating loop nests based on profile information?

Philip Reames listmail at philipreames.com
Tue Jan 20 11:32:53 PST 2015


On 01/12/2015 01:11 PM, Daniel Berlin wrote:
>
> On Mon, Jan 12, 2015 at 1:08 PM, Philip Reames 
> <listmail at philipreames.com <mailto:listmail at philipreames.com>> wrote:
>
>
>     On 01/11/2015 10:35 PM, Daniel Berlin wrote:
>>
>>         Looking at what LLVM does, the failing on the PRE side is
>>         that our PRE/GVN models are not strong enough to handle this.
>>         I'm not sure at all why we think anything else is necessary.
>>
>>
>>     By this i mean we don't actually perform real PRE for loads. It's
>>     a known deficiency and one that does not require any duplicate
>>     heuristics to solve.
>>     Right now we perform some availability based removal that happens
>>     be willing to do an insertion here or there to try to move a load
>>     in some cases (it will only do a single insertion).  It's not
>>     also based on real availability, but on finding nearest
>>     dependencies of loads and squinting at them (which is not even
>>     really the same as finding where a load stops being available) funny.
>>
>>     It's also not real PRE, just some code that is willing to insert
>>     one load to move a load.
>     Er, I think we have a terminology problem here. From what I can
>     gather, our current implementation *is* an implementation of
>     classic PRE.
>
>
> Scalar yes, load PRE, no.
Can you definite specifically what you mean by "load pre"?  To me, load 
pre is simply PRE on loads.  It sounds like you're saying something 
slightly different.
>
> What load PRE does is not the same as availability.
> It finds the nearest dependencies (be they load/load, store/load, or 
> clobber) for a load walking backwards.
> That is not availability.
>
> Availability is a forward problem, not a backwards one :)
>
> It tries to transform the data it gets from memdep into something like 
> availability info, but you *will not get the same results* in some cases.
I would argue that the current implementation is calculating 
availability.  It's simply doing in a conservative manner and thus gives 
up some quality in the results achieved.  The notion of availability 
(i.e. "there's a load available along all paths that reach this basic 
block") is the same.

Just to make sure we're on the same page, the reason why the backwards 
calculation is potentially less useful than the forward one is that we 
unconditionally stop when encountering a def.  There might have been an 
earlier def along that path and thus possible a cheaper location to put 
a load.  Is that the only case?  Or is there something else I haven't 
seen yet?
>
>     * It took me a few hours of squinting at the code to recognize
>     that weird backwards walk as being an implementation of
>     "anctipated", but it is.  The "available" part comes from
>     MemoryDependenceAnalysis and the forward dataflow algorithm based
>     on those inputs.
>
>
> memorydependencyanalysis is a backwards walker, it does not compute 
> availability of a load.
> It computes the nearest dependencies.  We then try to take this and 
> turn it into the result of a forward availability problem.
(See above comment)
>
>
>     If you want to start inserting loads on paths which didn't
>     previously have them - for example the loop preheader in your
>     example in the previous email - this is not classic PRE as I'm
>     familiar with it.  It can be very profitable and worthwhile, but
>     is not classic PRE.   I'm not sure what this variant is known as
>     in the literature.
>
>
> This is definitely a standard SSA based PRE algorithm. Every single 
> one i am familiar with will all move code out of loops like this, be 
> it scalar or loads, regardless of the fact that it does in fact, 
> increase the number of possible computations by one if the loop does 
> not execute.
I'd written a response on the technicalities here, then realized it 
really didn't matter.  From a practical perspective, we definitely want 
our PRE implementation to pull loads out of loops when legal, even if 
that does introduce a load along a path it didn't exist previously.  
(Note, there is that pesky legality issue.)

I've got a patch up for discussion right now in exactly this vein: 
http://reviews.llvm.org/D7061
>
>>     If you want this case to work, what you need is a real Load PRE
>>     implementation, not one based on simple memdep based load moving.
>>
>>     Trying to hack this one up with profile guided code duplication,
>>     etc, to get it to kinda sorta work seems to me to be a bad idea.
>     Not sure I really agree here, but I'm open to pushing as far as we
>     can with static heuristics first. I suspect that's likely to be
>     more stable w.r.t. optimization effect and applicable to more cases.
>
>
> So, LLVM has been through three PRE implementations.
> It had an implementation of e-path PRE, an implementation of GVNPRE 
> (that only worked on scalar code i believe) a long long time ago, and 
> then the current one.
>
> The current one is used because it's fast and catches a lot of cases.
> The others were eliminated because they were essentially experimental 
> research-quality implementations, and were slow :)
After looking more at both the current implementation and the 
literature, I've largely accepted we need a more general PRE algorithm.  
Unfortunately, I really don't have the background to implement such a 
thing today.  I can gain it, but that's an investment of time I'm not 
sure I can justify right now.  I'm slowly ramping up, but it's taking a 
while.
>
> Because the current PRE is not value based, and it really has no idea 
> about memory, you will run into limitations on loads pretty quickly 
> (particularly loads indexed by things like induction variables).
> It already tries to work around some of them by doing some actual 
> value analysis of the load (and doing forwarding), but this only works 
> on basic cases.
Ok, again we've run into a terminology problem.  I don't understand why 
you mean by "is not value based" and "has no idea about memory".

Given that we're both in the Mountain View area, would you be interested 
in just meeting for lunch or a beer?  Trying to go back and forth in 
writing is going to be far slower than talking in person.
>
> GVNPRE can be implemented and made a lot faster than it was (In GCC, 
> it's actually one of the fastest high level optimization passes) with 
> very careful engineering, and would take care of all of these cases.
>
> I'm also fairly positive you could take any good anticipation based 
> PRE algorithm you can find, and i could make it value based and still 
> fast.
>
> But in the end, pushing the current one seems like a bad idea to me 
> because it wasn't designed for what you are trying to do - get good 
> PRE results, and do good load PRE, and we know of better ways to do that.
>
> All pushing it will do is push it out of the sweet spot it was 
> designed for.
(As I said above, I've largely convinced myself you're right.  I do want 
to extend the current algorithm in a few, hopefully narrow, ways, but in 
the mid to long run, we'll need something better.)
>
> If you want to get into profille-guided placement or speculative PRE, 
> it involves solving min-cut problems on very large graphs in most cases.
Yeah, I'd come to this conclusion myself.

Now, purely for the sake of argument, what's wrong with that?  Given you 
can use max flow to solve min-cut in O(V*E^2), that doesn't seem too 
bad.  I know we generally try to avoid extra linear growth, but it seems 
like you could a) reduce the graph substantially without much work, and 
b) cap the region you look at.

Here's one possible approach.  Don't take this too seriously, I'm just 
thinking through the implications.

Start by computing forward availability for the location in question.  
(In particular, I'm assuming availability starts at the earliest point 
on a path a load would be legal, not the first place it occurs.  This 
isn't(?) quite the standard definition.)

Pose a weighted min-cut problem on this subset of the CFG identified as 
being fully available.  The source vertices are the nodes with no 
available predecessors, the sink vertices are the *uses* of the current 
loads, and the weights are profile weights + some factor to represent 
code size.

Before running mincut, summarize any loop in the original CFG which is 
fully available (and thus completely in the refined graph) with a single 
node.  We statically know that placing the load in the loops preheader 
is at least as good as any other placement in the loop.

Before running mincut, reduce weights to the minimum of all weights in 
dominating and "obviously" post dominating nodes.  This is just to 
reduce the amount of work path finding might do in a subgraph. It's not 
clear this would actually be worthwhile.  (If we had a post dom tree, we 
could remove the "obviously" part of that.)

To cap the region, we could do something like only include the subgraph 
contained by the parent of the inner most loop which contains the loads 
from the location we're interested in.  We'd then have to run this 
iteratively (which is potentially faster due to loop summarization 
above), but we should reach the same final result...

(The starting assumption in the above is that we already have both dom 
tree information and loop information.  We can exploit that in the max 
flow problem.)

There might also be room for a "cheap pre" and an "expensive pre" here.  
Particularly if the "cheap-pre" got things out of loops and the 
"expensive-pre" exploited the loop summarization trick...  Or vice versa 
possibly, getting loads out of loops is where it's worth spending time, 
so maybe we do an "expensive pre" only on loops and the "cheap" version 
outside of inner loops?

Again, this feels like a topic where a) I need to read a lot more 
papers, and b) it would be good to talk in person.
>
> The only implementation i'm aware of that is relatively sane is 
> http://webdocs.cs.ualberta.ca/~amaral/cascon/CDP05/slides/CDP05-horspool.pdf 
> <http://webdocs.cs.ualberta.ca/%7Eamaral/cascon/CDP05/slides/CDP05-horspool.pdf>
> and
> https://dspace.library.uvic.ca/bitstream/handle/1828/292/Pereira_0336403_Thesis-Final_21Dec2007_.pdf?sequence=1
>
> (This does not require solving min-cut problems)
I haven't gotten through this yet, but I'll add it to my reading list.
> In the end though, you are doing the work, so you get to decide what 
> you want to do :)
Yes, but I need to get it accepted by reviewers too.  :)

Philip

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150120/447a2444/attachment.html>


More information about the llvm-dev mailing list