<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div><div>On Nov 28, 2010, at 11:39 PM, Xinliang David Li wrote:</div><blockquote type="cite"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; position: static; z-index: auto; ">1. Run the inliner on bar (noop, since it has no call sites)<br>
2. Run the per-function passes on bar. This generally shrinks it, and prevents "abstraction penalty" from making bar look too big to inline.<br>
3. Run the inliner on foo. Since foo calls bar, we consider inlining bar into foo and do so if profitable.<br>
4. Run the per-function passes on foo. If bar got inlined, this means that we're running the per-function passes over the inlined contents of bar again.<br></blockquote><div><br></div><div>On-the-fly clean up (optimization) while doing bottom up inlining is nice as you described. Many other compilers chose not to do this way due to scalability concerns (with IPO) -- this can make the IPO the biggest bottom neck in terms of compile time (as it is serialized). Memory many not be a big issue for LLVM as I can see the good locality in pass manager. (Just curious, what is biggest application LLVM can build with IPO?)</div></div></blockquote><div><br></div><div>I don't really know, and I agree with you that LLVM's LTO isn't very scalable (it currently loads all the IR into memory). I haven't thought a lot about this, but I'd tackle that problem in three stages:</div><div><br></div><div>1. Our LTO model runs optimizations at both compile and link time, the compile-time optimizations should work as they do now IMO. This is controversial though, because doing so could cause (e.g.) an inlining to happen "early" that would be seen as a bad idea with full LTO information. The advantage of doing compile-time optimizations is that it both shrinks the IR, and speeds up an incremental rebuild by avoiding having to do simple optimizations again.</div><div><br></div><div>2. At LTO time, the bottom-up processing of the callgraph is still goodness and presents good locality (unless you have very very large SCC's). The tweak that we'd have to implement is lazy deserialization (already implemented) and reserialization to disk (which is missing). With this, you get much better memory footprint than "hold everything in memory at once".</div><div><br></div><div>3. To support multiple cores/machines, you break the callgraph SCC DAG into parallel chunks that can be farmed out. There is a lot of parallelism in a DAG.</div><div><br></div><div>I don't know of anyone planning on working on LTO at the moment though.</div><div><br></div><blockquote type="cite"><div class="gmail_quote">
<blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; position: static; z-index: auto; ">In a traditional optimizer like GCC's, you end up with problems where you have to set a high inline threshold due to inlining-before-optimizing causing "abstraction penalty problems". An early inliner is a hack that tries to address this. </blockquote>
<div><br></div><div>It is a hack in some sense (but a common practice) -- but enables other flexibilities. </div></div></blockquote><div><br></div><div>The hack I'm referring to is the "raise the inline threshold". If the inliner has any language specificity to its inline threshold, I consider it a hack. There is no reason the inliner should have to know if it's building C or C++ code. It should be guided based on the structure of the code.</div><div><br></div><blockquote type="cite"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; position: static; z-index: auto; "> Another problem with this approach from the compile time perspective is that you end up repeating work multiple times. For example, if there is a common subexpression in a small function, you end up inlining it into many places, then having to eliminate the common subexpression in each copy.<br>
</blockquote><div><br></div><div>Early inlining + scalar opt can do the same, right?</div></div></blockquote><div><br></div><div>In some cases, but not in general, because you run into phase ordering problems.</div><div><br></div><blockquote type="cite"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; position: static; z-index: auto; "><div class="im">This is true, but I don't think it's a really large problem in practice. We don't have a "global inline threshold limit" (which I've never understood, except as a hack to prevent run-away inlining) so not visiting in priority order shouldn't prevent high-priority-but-processed-late candidates from being inlined.</div>
</blockquote><div><br></div><div>global threshold can be used to control the unnecessary size growth. In some cases, the size increase may also cause increase in icache footprint leading to poor performance. In fact, with IPO/CMO, icache footprint can be modeled in some way and be used as one kind of global limit. </div></div></blockquote><div><br></div><div>I understand that, but that implies that you have some model for code locality. Setting a global code growth limit is (in my opinion) a hack unless you are aiming for the whole program to fit in the icache (which I don't think anyone tries to do :).</div><div><br></div><div>With any other limit that is higher than your icache size, you are basically picking an *arbitrary* limit that is not based on the machine model or the instruction locality of the program. </div><div><br></div><blockquote type="cite"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; position: static; z-index: auto; ">
The "punish functions containing lots of calls" is based on the assumption that functions which are mostly calls (again, this decision happens after the callee has been inlined and simplified) aren't themselves doing much work.<br>
</blockquote><div><br></div><div>My point is that using static count of callsites as a indicator for this can be misleading. All the calls may be calls to cold external functions for instance.</div></div></blockquote><br></div><div>Absolutely true. It may also be completely wrong for some functions. It's a heuristic :)</div><div><br></div><div>-Chris</div><br></body></html>