<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Sep 18, 2017 at 1:16 PM, Andrew Trick <span dir="ltr"><<a href="mailto:atrick@apple.com" target="_blank" class="cremed">atrick@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><br><div><blockquote type="cite"><div><div class="h5"><div>On Sep 14, 2017, at 6:53 PM, Kyle Butt via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank" class="cremed">llvm-dev@lists.llvm.org</a>> wrote:</div><br class="m_-4149091655691126002Apple-interchange-newline"></div></div><div><div><div class="h5"><div dir="ltr"><div style="font-size:12.8px">I plan on rewriting the block placement algorithm to proceed by traces.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">A trace is a chain of blocks where each block in the chain may fall through to</div><div style="font-size:12.8px">the successor in the chain.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">The overall algorithm would be to first produce traces for a function, and then</div><div style="font-size:12.8px">order those traces to try and get cache locality.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Currently block placement uses a greedy single step approach to layout. It</div><div style="font-size:12.8px">produces chains working from inner to outer loops. Unlike a trace, a chain may</div><div style="font-size:12.8px">contain non-fallthrough edges. This causes problems with loop layout. The main</div><div style="font-size:12.8px">problems with loop layout are: loop rotation and cold blocks in a loop.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Overview of proposed solution:</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Phase 1:</div><div style="font-size:12.8px">Greedily produce a set of traces through the function. A trace is a list of</div><div style="font-size:12.8px">blocks with each block in the list falling through (possibly conditionally) to</div><div style="font-size:12.8px">the next block in the list. Loop rotation will occur naturally in this phase via</div><div style="font-size:12.8px">the triangle replacement algorithm below. Handling single trace loops requires a</div><div style="font-size:12.8px">tweak, see the detailed design.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Phase 2:</div><div style="font-size:12.8px">After producing what we believe are the best traces, they need to be ordered.</div><div style="font-size:12.8px">They will be ordered topologically, except that traces that are cold enough (As</div><div style="font-size:12.8px">measured by their warmest block) will be floated later, This may push them out</div><div style="font-size:12.8px">of a loop or to the end of the function.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Detailed Design</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Note whenever an edge is used as a number, I am referring to the edge frequency.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Phase 1: Producing traces</div><div style="font-size:12.8px">Traces are produced according to the following algorithm:</div><div style="font-size:12.8px"> * Sort the edges according to weight, stable-sorting them according the incoming</div><div style="font-size:12.8px">block and edge ordering.</div><div style="font-size:12.8px"> * Place each block in a trace of length 1.</div><div style="font-size:12.8px"> * For each edge in order:</div><div style="font-size:12.8px"> * If the source is at the end of a trace, and the target is at the beginning</div><div style="font-size:12.8px"> of a trace, glue those 2 traces into 1 longer trace.</div><div style="font-size:12.8px"> * If an edge has a target or source in the middle of another trace, consider</div><div style="font-size:12.8px"> tail duplication. The benefit calculation is the same as the existing</div><div style="font-size:12.8px"> code.</div><div style="font-size:12.8px"> * If an edge has a source or target in the middle, check them to see if they</div><div style="font-size:12.8px"> can be replaced as a triangle. (Triangle replacement described below)</div><div style="font-size:12.8px"> * Compare the benefit of choosing the edge, along with any triangles</div><div style="font-size:12.8px"> found, with the cost of breaking the existing edges.</div><div style="font-size:12.8px"> * If it is a net benefit, perform the switch.</div><div style="font-size:12.8px"> * Triangle checking:</div><div style="font-size:12.8px"> Consider a trace in 2 parts: A1->A2, and the current edge under consideration</div><div style="font-size:12.8px"> is A1->B (the case for C->A2 is mirror, and both may need to be done)</div><div style="font-size:12.8px"> * First find the best alternative C->B</div><div style="font-size:12.8px"> * Check for an alternative for A2: D->A2</div><div style="font-size:12.8px"> * Find D's best Alternative: D->E</div><div style="font-size:12.8px"> * Compare the frequencies: A1->A2 + C->B + D->E vs A1->B + D->A2</div><div style="font-size:12.8px"> * If the 2nd sum is bigger, do the switch.</div><div style="font-size:12.8px"> * Loop Rotation Tweak:</div><div style="font-size:12.8px"> If A contains a backedge A2->A1, then when considering A1->B or C->A2, we</div><div style="font-size:12.8px"> can include that backedge in the gain:</div><div style="font-size:12.8px"> A1->A2 + C->D + E->B vs A1->B + C->A2 + A2->A</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Phase 2: Order traces.</div><div style="font-size:12.8px">First we compute the frequency of a trace by finding the max frequency of any of</div><div style="font-size:12.8px">its blocks.</div><div style="font-size:12.8px">Then we attempt to place the traces topologically. When a trace cannot be placed</div><div style="font-size:12.8px">topologically, we prefer warmer traces first.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Questions and comments welcome.</div></div></div></div>
______________________________<wbr>_________________<br></div></blockquote><br></div><div>This algorithm should be easy enough to implement and experiment with out of tree. A reasonable goal would be to identify cases that the current, more sophisticated algorithm are handling poorly. At that point, it would be much more useful to fix the current algorithm’s heuristics rather than introducing another less sophisticated alternative.</div></div></blockquote><div><br></div><div>I wrote many of the current algorithms heuristics, and they feel like hacks to work around specific problems.</div><div><br></div><div>What I'm proposing is more, not less sophisticated. Handling loop rotation as part of a general lookahead problem is more sophisticated than the existing method of laying out the loop and praying that one of the n rotations will be a good one.</div><div>Even the tail-duplication support I added is kind of a hack. It would be better to do block cloning to handle cases where tail-duplication currently fails</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div><br></div><div>The current algorithm deals well with partial profile information and maintains a number of layout properties relative to the CFG structure. I’m not the best person to explain all of this, but you will certainly need to understand the original goals, show concrete examples of how it “fails” and can’t be easily fixed, before you propose replacing it.</div></div></blockquote><div><br></div><div>I'll make sure to include test cases with any diffs I mail out.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div><br></div><div>-Andy</div></div></blockquote></div><br></div></div>