[LLVMdev] Question regarding basic-block placement optimization
atrick at apple.com
Fri Oct 21 18:17:07 PDT 2011
On Oct 20, 2011, at 11:53 PM, Chandler Carruth wrote:
> On Thu, Oct 20, 2011 at 10:53 AM, Jakob Stoklund Olesen <stoklund at 2pi.dk> wrote:
> On Oct 20, 2011, at 9:56 AM, Chandler Carruth wrote:
> > A new patch is attached that is *much* less of a rough draft. Sorry for any confusion due to the early state of the patch.
> Thanks, Chandler. This is great stuff.
> > Still, I never intended this to be on-by-default at first. This is a starting point, that I hope can be improved into something that is on-by-default eventually, but I'll be the first to say it's almost certainly not ready for that just yet.
> I would like to see this go in ASAP under a flag. I prefer to see development as commits rather than a series of updated patches.
> Awesome, same way I feel. Checked in as r142641. I've got some generic cleanup I'll go ahead and commit to it over the next few days.
> Done, and in progress. SHould have the loop alignment stuff cloned right away, and then i'll start looking at the loop structure issues. I'll probably have some questions for you and/or andy about exactly how that should work, whether CodePlacementOpt is doing the right thing, etc.
Your work looks good and there's talk of making this the default
layout pass. So let's figure out the goals and requirements. We agree
that preserving "structure" of the CFG is important in the absence of
profile information to the contrary. This is important for performance
stability in the presence of imperfect profile data and unbiased
branches. But it's also important for the sanity of fellow compiler
engineers and anyone debugging the disassembly. We can define
structure very simply by saying that we want contiguous loops and
topologically ordered regions within a loop. Those are not the
strongest constraints. We could still end up with strangely shuffled
code, but they are easy constraints to define.
Once you decide to break these constraints, you have effectively
designated the offending paths as "cold". It's up to you how to
decide which paths are cold. You'll need some threshold based on
confidence in the profile, and we can measure the effectiveness of
your heuristics and add more over time. The key point is that the
layout algorithm needs to make an explicit binary decision at some
point to violate the natural structure of the CFG.
I don't care much about how you layout the cold chains as long as they
are not interleaved with the rest of the code, thus breaking the
topological ordering and forcing extra branches. In practice, this
means they get their own little Siberia after the function return,
effectively splitting the function into two sections. However, the
notion of "coldness" is really relative to the current loop head. So
you may layout a nice contiguous, topologically ordered inner loop,
then determine the whole thing is cold relative to its outer loop, so
send the whole thing packing to Siberia. That's exactly what we
want. The loop itself is sane, but doesn't get in the way of hotter
Even within the constraints imposed by CFG structure, you can reduce
the likelihood of taken branches. Here you can employ any
profile-based or static heuristics with lower confidence, because the
resulting layout will be reasonably good either way. This will still
do a great job in the absence of accurate profile information, builtin
expect, or very strong static heuristics (e.g. landing pads are cold).
This is where the algorithm gets interesting and you can be creative,
but I'll give you a rough idea so we're on the same page:
1) Visit each loop independently inside-out.
2) Construct chains of blocks within the loop. Here the "entry" chain
is the loop header, or function entry. You can probably represent
each already laid out inner loop as a single chain. The graph of
chains within a loop is *acyclic* except for irreducible control
flow, which doesn't have any structure to preserve anyway. There's
no need to compute SCCs.
3) Stop constructing the chain at a CFG merge. Chaining past a merge
would immediately violate topological ordering, which is only ok if
other predecessors of the merge have been designated "cold".
4) Splice the chains together, preserving topological order. Assigning
each block to a DFS reverse-post-order number provides a nice
ordering that is easy to compare or sort.
Let me give you a little example to think about:
br B(80%), C(20%)
br D(90%), E(10%)
This is a fine topological sort but bad layout based on expected
branch probability (even with low confidence in the profile).
The only good layout that preserves topological ordering is:
A, B, C, E, D, F
.8 * 1 branch + .18 * 2 branches + .02 * 2 branches
This approach likely covers the existing functionality of
CodePlacement, which attempts to keep loops contiguous.
You will probably need to do the loop tail adjustment that
CodePlacement currently does as a special case.
(Shift the tail of the loop to precede its header, resulting in
a fall-through loop test.)
We can discuss the design more, but you may want to think about it and
come with your own design first.
Finally, we need some way to validate the design and verify the
implementation. Weighted statistics will be very useful here, similar
to those that Jakob used in the register allocator. For example:
For each nonadjacent edge:
TakenBranchFrequency += frequency(Edge)
This metric will prefer the most "extreme" layout because it assumes
perfect accuracy of block frequencies. It would be nice to look at the
same metric computed from skewed branch probabilities to see how
sensitive it is to small shifts in branch behavior.
Taken branch frequency is the most obvious, but doesn't completely
capture locality. You could also measure distance jumped fairly
easily, also weighted by edge frequency.
It might be useful to gather these stats as late as possible to take
into consideration the final impact of layout, including the effect on
any later passes.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev