[LLVMdev] Question regarding basic-block placement optimization

Sat Oct 22 04:25:33 PDT 2011

On Fri, Oct 21, 2011 at 6:17 PM, Andrew Trick <atrick at apple.com> wrote:

> On Oct 20, 2011, at 11:53 PM, Chandler Carruth wrote:
>
> On Thu, Oct 20, 2011 at 10:53 AM, Jakob Stoklund Olesen <stoklund at 2pi.dk>wrote:
>>
>> On Oct 20, 2011, at 9:56 AM, Chandler Carruth wrote:
>>
>> > A new patch is attached that is *much* less of a rough draft. Sorry for
>> any confusion due to the early state of the patch.
>>
>> Thanks, Chandler. This is great stuff.
>>
>> > Still, I never intended this to be on-by-default at first. This is a
>> starting point, that I hope can be improved into something that is
>> on-by-default eventually, but I'll be the first to say it's almost certainly
>> not ready for that just yet.
>>
>> I would like to see this go in ASAP under a flag. I prefer to see
>> development as commits rather than a series of updated patches.
>
> ...
>
> Awesome, same way I feel. Checked in as r142641. I've got some generic
> cleanup I'll go ahead and commit to it over the next few days.
> Done, and in progress. SHould have the loop alignment stuff cloned right
> away, and then i'll start looking at the loop structure issues. I'll
> probably have some questions for you and/or andy about exactly how that
> should work, whether CodePlacementOpt is doing the right thing, etc.
>
>
> Your work looks good and there's talk of making this the default
> layout pass. So let's figure out the goals and requirements. We agree
> that preserving "structure" of the CFG is important in the absence of
> profile information to the contrary. This is important for performance
> stability in the presence of imperfect profile data and unbiased
> branches. But it's also important for the sanity of fellow compiler
> engineers and anyone debugging the disassembly. We can define
> structure very simply by saying that we want contiguous loops and
> topologically ordered regions within a loop. Those are not the
> strongest constraints. We could still end up with strangely shuffled
> code, but they are easy constraints to define.
>
> Once you decide to break these constraints, you have effectively
> designated the offending paths as "cold". It's up to you how to
> decide which paths are cold. You'll need some threshold based on
> confidence in the profile, and we can measure the effectiveness of
> your heuristics and add more over time. The key point is that the
> layout algorithm needs to make an explicit binary decision at some
> point to violate the natural structure of the CFG.
>
> I don't care much about how you layout the cold chains as long as they
> are not interleaved with the rest of the code, thus breaking the
> topological ordering and forcing extra branches. In practice, this
> means they get their own little Siberia after the function return,
> effectively splitting the function into two sections. However, the
> notion of "coldness" is really relative to the current loop head. So
> you may layout a nice contiguous, topologically ordered inner loop,
> then determine the whole thing is cold relative to its outer loop, so
> send the whole thing packing to Siberia. That's exactly what we
> want. The loop itself is sane, but doesn't get in the way of hotter
> enclosing code.
>
> Even within the constraints imposed by CFG structure, you can reduce
> the likelihood of taken branches. Here you can employ any
> profile-based or static heuristics with lower confidence, because the
> resulting layout will be reasonably good either way. This will still
> do a great job in the absence of accurate profile information, builtin
> expect, or very strong static heuristics (e.g. landing pads are cold).
>
> This is where the algorithm gets interesting and you can be creative,
> but I'll give you a rough idea so we're on the same page:
>
> 1) Visit each loop independently inside-out.
>
> 2) Construct chains of blocks within the loop. Here the "entry" chain
>    is the loop header, or function entry. You can probably represent
>    each already laid out inner loop as a single chain. The graph of
>    chains within a loop is *acyclic* except for irreducible control
>    flow, which doesn't have any structure to preserve anyway. There's
>    no need to compute SCCs.
>
> 3) Stop constructing the chain at a CFG merge. Chaining past a merge
>    would immediately violate topological ordering, which is only ok if
>    other predecessors of the merge have been designated "cold".
>
> 4) Splice the chains together, preserving topological order. Assigning
>    each block to a DFS reverse-post-order number provides a nice
>    ordering that is easy to compare or sort.
>

FYI, I'm still working on the details, and the code is a complete hack, but
I have a big chunk of this working now. =] The more precise strategy I
implemented follows your suggestion very closely. I've got lots more ideas
to layer on top of this, but this is my first stab at providing a better
underpinning.

First, it walks all loops in the function from the inside out just as you
describe. =] Really like this, it's a very nice model to use.

For each loop, after having walked all sub-loops, it walks forward across
all the blocks in the loop. For each block not part of a chain, it creates a
single-block chain. If the block has more than one predecessor (I think this
is what you were after w/ a "merge point"?), we're done. If the block is the
start of the chain it belongs to, and if it is the most likely of the
in-loop successors of the previous block, and that previous block is the
tail of its chain, we merge its chain to the previous block's chain. This
should only form chains which are valid in the existing topological ordering
within a loop, and don't cross loop boundaries or cross incoming branches.

Then we traverse the BB graph in RPO, and for each block look up its chain.
The first time we encounter a chain, we splice it into the function.

I think this satisfies your desire to preserve the structural components of
the CFG? If not, I'd love to understand where I've deviated / what I've
misunderstood.

It already makes some minimal use of the probabilities to select which
topology-preserving fall-through options to prefer. I'm also going to
tie-break with the existing layout. I've not started to really integrate it
with probabilities though, I wanted to get something that fit the constraint
model you're looking for first, and had clean traversal semantics.

I'll clean the patch up and mail it out tomorrow. The implementation gets
tremendously simpler by A) using the loop info, and B) exclusively walking
forward over the blocks. The first simplified the traversal greatly, and the
second made it easy to just store pointers to basic blocks for the chains
rather than moving blocks around in the function itself. That gets rid of a
bunch of other code, and the most bug-prone of the other code. So all in
all, I like where this is going.

> Let me give you a little example to think about:
>
> A:
>   br B(80%), C(20%)
> B:
>   br D
> C:
>   br D(90%), E(10%)
> D:
>   br F
> E:
>   br F
> F:
>   ret
>
> This is a fine topological sort but bad layout based on expected
> branch probability (even with low confidence in the profile).
>
> The only good layout that preserves topological ordering is:
> A, B, C, E, D, F
>
> .8 * 1 branch + .18 * 2 branches + .02 * 2 branches
>
> This approach likely covers the existing functionality of
> CodePlacement, which attempts to keep loops contiguous.
> You will probably need to do the loop tail adjustment that
> CodePlacement currently does as a special case.
> (Shift the tail of the loop to precede its header, resulting in
> a fall-through loop test.)
>
> We can discuss the design more, but you may want to think about it and
> come with your own design first.
>
> Finally, we need some way to validate the design and verify the
> implementation. Weighted statistics will be very useful here, similar
> to those that Jakob used in the register allocator. For example:
>
> For each nonadjacent edge:
>   TakenBranchFrequency += frequency(Edge)
>
> This metric will prefer the most "extreme" layout because it assumes
> perfect accuracy of block frequencies. It would be nice to look at the
> same metric computed from skewed branch probabilities to see how
> sensitive it is to small shifts in branch behavior.
>
> Taken branch frequency is the most obvious, but doesn't completely
> capture locality. You could also measure distance jumped fairly
> easily, also weighted by edge frequency.
>
> It might be useful to gather these stats as late as possible to take
> into consideration the final impact of layout, including the effect on
> any later passes.
>
> -Andy
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111022/6a6a1fb5/attachment.html>