[llvm-commits] [llvm] r142743 - in /llvm/trunk: lib/CodeGen/MachineBlockPlacement.cpp test/CodeGen/X86/block-placement.ll

Wed Oct 26 01:13:19 PDT 2011

Just a brief note at the moment, I'll dig into this more deeply hopefully
soon.

On Tue, Oct 25, 2011 at 11:18 PM, Andrew Trick <atrick at apple.com> wrote:

> Sorry, I've been neglecting this a bit. My initial impression of your
> current design is that the framework is good and the intention of the
> algorithm is good, but placeChainsTopogically could do a better job in
> few areas (1) biasing "warm" edges, (2) placing "cold" chains, (3)
> loop handling.
>

For reference, I couldn't agree more with the assessment. It's exactly the
conclusion I came to. However, it got far enough that the problems with our
probability heuristics became quite painful. I'm working on making those
better while mulling over some of the very issues you highlight.

> (1) "Warm" edges are those that cross chains, and whose successor is
> the head of a chain. In other words, you've decided to place the
> chains "in-line" in topological order. Placing them in any arbitrary
> topological/RPO order is not bad at all since you've already chosen
> good chains. But consider a merge point with three or more
> predecessors. You actually want the most frequent predecessor to fall
> through to the merge, unless the head of that chain dominates the
> other predecessors. This is tricky, rife with corner cases, and
> probably not too important, especially without perfect profile
> information. So it would be fair to punt on this issue.
>
> (2) "Cold" edges branch into the middle of another chain. But it looks
> like placeChainsTopogically happily follows the RPO order of the
> chain's head. I'm guessing they'll end up in strange places unrelated
> to their adjacent chains. This doesn't seem like what you intended. If
> you avoid placing those in RPO order, then they should float down to
> the end of the function.
>

Yea, the way #1 (and to an extent #2) work currently is that we merge the
"best" successor with a particular block whenever it doesn't violate the
topological ordering, and when we see a particularly hot edge that does, we
"merge" it to force sequential layout. Then the cold chain ends up getting
inserted the next time we fail to merge. You're absolutely correct that this
isn't really great on either side.

Handling the cold chains is, I suspect, the easier of the two problems. I
just think we should have a completely separate layout mechanism for them
that runs after finishing the warm chain layout.

I'm still not 100% certain what the best approach is to ensure warm chain
layout is as good as it can be within the topological ordering... Going to
keep thinking here.

> (3) I've always thought we would regret that LoopInfo doesn't keep
> it's list of blocks in RPO order.

YES. I would greatly appreciate making an RPO order guarantee at the loop
info and/or function layer from an analysis that I can piggyback on. Doing
it at the loop layer would be wonderful for implementing this kind of
algorithm.

> You're currently sidestepping the
> issue by placing chains in a single iteration through the function's
> blocks. If that works well, I don't have a problem with it, but it
> seems to complicate the issue of loop layout. For example, our simple
> RPO iterator makes no attempt to visit blocks in loop order, so
> following this order will not result in contiguous loops. I think
> laying out one loop at a time is much easier to deal with. If you want
> to follow blocks within a loop in RPO order, you can use the
> LoopIterator that I recently added for this purpose. It does need to
> be adapted for MachineLoopInfo.

I'll look into LoopIterator. If it were to get adapted for MachineLoopInfo,
that would certainly help. =D

I wonder however, the Loop object already has a vector of BB*; it would seem
like it could just place that vector in RPO, and be done with it? Dunno, as
I've not looked at LoopIterator, etc.

I actually like the BranchProbability
> approach of implementing the generic code in an implementation header
> that's only included once per instantiation
> (BranchProbabilityInfo.cpp + MachineBranchProbabilityInfo.cpp).
>

I think you mean BlockFrequencyInfo, but yes I agree with the spirit here.
LoopInfo is already done this way, with generic templated base classes that
are specialized for MachineLoop* classes.

> Another alternative is to do the topological sort without relying on
> DFS. This can be done using a worklist of valid blocks. You need to
> attach a "remaining predecessor" counter to each block. When all preds
> have been laid out (or pushed out-of-line), the block can be pushed on
> the valid queue. This might provide the freedom you need to implement
> better profile-based heuristics, but could also result in more
> arbitrary shuffling of blocks in the absence of good heuristics for
> picking the next candidate. In other words, it could be the basis of a
> solution to both the "warm" layout problem, and loop handling, but you
> have an additional problem of choosing a "valid" block that is somehow
> related to the last block that was laid out. I supposed you could just
> sort the chains within the valid set similar to how you sorted the
> full chain list in the Pettis & Hansen prototype. I think it's an
> interesting idea, but haven't completely thought through it.
>

This is actually a really interesting approach. I'm going to think about it
more. Having a worklist would indeed simplify much, and as you mention
below, it shouldn't be hard to pull in the existing layout as a tie breaker
which won't introduce noise.

Implementation Details:
>
> Do you think it's important to keep a vector of blocks inside each
> chain, as opposed to incrementally updating the existing ilist and
> storing pointers to the chain head and tail? AFAIK, you can reorder
> the blocks all you want then update terminators later, but I could be
> wrong.

My first implementation did this, and I have to say I found it frought with
peril. Iterator invalidation is a nightmare here. Also, once you commit to
doing this, you loose the easy ability to check for the existing code layout
as a metric. As long as the merging occurs in a forward-moving algorithm,
and thus we append to block chains only, I suspect the vector approach will
be both simpler and not noticeably slower. There are already special cases
to avoid creating needless 1-block chains, and a small vector could be used
to further reduce malloc traffic. I just wanted to wait for a sad profile
before doing anything drastic.

Either way, chain merge can't be constant time because of the
> BlockChain map update. But you could avoid regrowing a lot of
> vectors. Along those lines, if you ever decide to use union-find, we
> have IntEqClasses.
>
> Should we have an inline version of getEdgeProbability() that simply
> returns BranchProbability::getOne() for single-successor blocks?

Probably. I'll look into to that as I clean up BranchProbability stuff.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20111026/cec9b57a/attachment.html>