<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div><div>On Oct 20, 2011, at 11:53 PM, Chandler Carruth wrote:</div><blockquote type="cite"><div class="gmail_quote">On Thu, Oct 20, 2011 at 10:53 AM, Jakob Stoklund Olesen <span dir="ltr"><<a href="mailto:stoklund@2pi.dk">stoklund@2pi.dk</a>></span> wrote:<blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; position: static; z-index: auto; "><div class="im">

On Oct 20, 2011, at 9:56 AM, Chandler Carruth wrote:<br>

<br>

> A new patch is attached that is *much* less of a rough draft. Sorry for any confusion due to the early state of the patch.<br>

<br>

</div>Thanks, Chandler. This is great stuff.<br>

<div class="im"><br>

> Still, I never intended this to be on-by-default at first. This is a starting point, that I hope can be improved into something that is on-by-default eventually, but I'll be the first to say it's almost certainly not ready for that just yet.<br>


<br>

</div>I would like to see this go in ASAP under a flag. I prefer to see development as commits rather than a series of updated patches.</blockquote></div></blockquote>...<br><blockquote type="cite"><div class="gmail_quote"><div>Awesome, same way I feel. Checked in as r142641. I've got some generic cleanup I'll go ahead and commit to it over the next few days.</div><div>Done, and in progress. SHould have the loop alignment stuff cloned right away, and then i'll start looking at the loop structure issues. I'll probably have some questions for you and/or andy about exactly how that should work, whether CodePlacementOpt is doing the right thing, etc.</div>

</div>

</blockquote></div><br><div><div>Your work looks good and there's talk of making this the default</div><div>layout pass. So let's figure out the goals and requirements. We agree</div><div>that preserving "structure" of the CFG is important in the absence of</div><div>profile information to the contrary. This is important for performance</div><div>stability in the presence of imperfect profile data and unbiased</div><div>branches. But it's also important for the sanity of fellow compiler</div><div>engineers and anyone debugging the disassembly. We can define</div><div>structure very simply by saying that we want contiguous loops and</div><div>topologically ordered regions within a loop. Those are not the</div><div>strongest constraints. We could still end up with strangely shuffled</div><div>code, but they are easy constraints to define.</div><div><br></div><div>Once you decide to break these constraints, you have effectively</div><div>designated the offending paths as "cold". It's up to you how to</div><div>decide which paths are cold. You'll need some threshold based on</div><div>confidence in the profile, and we can measure the effectiveness of</div><div>your heuristics and add more over time. The key point is that the</div><div>layout algorithm needs to make an explicit binary decision at some</div><div>point to violate the natural structure of the CFG.</div><div><br></div><div>I don't care much about how you layout the cold chains as long as they</div><div>are not interleaved with the rest of the code, thus breaking the</div><div>topological ordering and forcing extra branches. In practice, this</div><div>means they get their own little Siberia after the function return,</div><div>effectively splitting the function into two sections. However, the</div><div>notion of "coldness" is really relative to the current loop head. So</div><div>you may layout a nice contiguous, topologically ordered inner loop,</div><div>then determine the whole thing is cold relative to its outer loop, so</div><div>send the whole thing packing to Siberia. That's exactly what we</div><div>want. The loop itself is sane, but doesn't get in the way of hotter</div><div>enclosing code.</div><div><br></div><div>Even within the constraints imposed by CFG structure, you can reduce</div><div>the likelihood of taken branches. Here you can employ any</div><div>profile-based or static heuristics with lower confidence, because the</div><div>resulting layout will be reasonably good either way. This will still</div><div>do a great job in the absence of accurate profile information, builtin</div><div>expect, or very strong static heuristics (e.g. landing pads are cold).</div><div><br></div><div>This is where the algorithm gets interesting and you can be creative,</div><div>but I'll give you a rough idea so we're on the same page:</div><div><br></div><div>1) Visit each loop independently inside-out.</div><div><br></div><div>2) Construct chains of blocks within the loop. Here the "entry" chain</div><div>   is the loop header, or function entry. You can probably represent</div><div>   each already laid out inner loop as a single chain. The graph of</div><div>   chains within a loop is *acyclic* except for irreducible control</div><div>   flow, which doesn't have any structure to preserve anyway. There's</div><div>   no need to compute SCCs.</div><div><br></div><div>3) Stop constructing the chain at a CFG merge. Chaining past a merge</div><div>   would immediately violate topological ordering, which is only ok if</div><div>   other predecessors of the merge have been designated "cold".</div><div><br></div><div>4) Splice the chains together, preserving topological order. Assigning</div><div>   each block to a DFS reverse-post-order number provides a nice</div><div>   ordering that is easy to compare or sort.</div><div><br></div><div>Let me give you a little example to think about:</div><div><br></div><div>A:</div><div>  br B(80%), C(20%)</div><div>B:</div><div>  br D</div><div>C:</div><div>  br D(90%), E(10%)</div><div>D:</div><div>  br F</div><div>E:</div><div>  br F</div><div>F:</div><div>  ret</div><div><br></div><div>This is a fine topological sort but bad layout based on expected</div><div>branch probability (even with low confidence in the profile).</div><div><br></div><div>The only good layout that preserves topological ordering is:</div><div>A, B, C, E, D, F</div><div><br></div><div>.8 * 1 branch + .18 * 2 branches + .02 * 2 branches</div><div><br></div><div>This approach likely covers the existing functionality of</div><div>CodePlacement, which attempts to keep loops contiguous.</div><div>You will probably need to do the loop tail adjustment that</div><div>CodePlacement currently does as a special case.</div><div>(Shift the tail of the loop to precede its header, resulting in</div><div>a fall-through loop test.)</div><div><br></div><div>We can discuss the design more, but you may want to think about it and</div><div>come with your own design first.</div><div><br></div><div>Finally, we need some way to validate the design and verify the</div><div>implementation. Weighted statistics will be very useful here, similar</div><div>to those that Jakob used in the register allocator. For example:</div><div><br></div><div>For each nonadjacent edge:</div><div>  TakenBranchFrequency += frequency(Edge)</div><div><br></div><div>This metric will prefer the most "extreme" layout because it assumes</div><div>perfect accuracy of block frequencies. It would be nice to look at the</div><div>same metric computed from skewed branch probabilities to see how</div><div>sensitive it is to small shifts in branch behavior.</div><div><br></div><div>Taken branch frequency is the most obvious, but doesn't completely</div><div>capture locality. You could also measure distance jumped fairly</div><div>easily, also weighted by edge frequency.</div><div><br></div><div>It might be useful to gather these stats as late as possible to take</div><div>into consideration the final impact of layout, including the effect on</div><div>any later passes.</div></div><div><br></div><div>-Andy</div></body></html>