<br><br><div class="gmail_quote">On Mon, Nov 29, 2010 at 10:56 AM, Chris Lattner <span dir="ltr"><<a href="mailto:clattner@apple.com">clattner@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div style="word-wrap:break-word"><div><div class="im"><div>On Nov 28, 2010, at 11:39 PM, Xinliang David Li wrote:</div><blockquote type="cite"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-color:rgb(204, 204, 204);border-left-style:solid;padding-left:1ex">

1. Run the inliner on bar (noop, since it has no call sites)<br>

2. Run the per-function passes on bar.  This generally shrinks it, and prevents "abstraction penalty" from making bar look too big to inline.<br>

3. Run the inliner on foo.  Since foo calls bar, we consider inlining bar into foo and do so if profitable.<br>

4. Run the per-function passes on foo.  If bar got inlined, this means that we're running the per-function passes over the inlined contents of bar again.<br></blockquote><div><br></div><div>On-the-fly clean up (optimization) while doing bottom up inlining is nice as you described. Many other compilers chose not to do this way due to scalability concerns (with IPO) -- this can make the IPO the biggest bottom neck in terms of compile time (as it is serialized).  Memory many not be a big issue for LLVM as I can see the good locality in pass manager. (Just curious, what is biggest application LLVM can build with IPO?)</div>

</div></blockquote><div><br></div></div><div>I don't really know, and I agree with you that LLVM's LTO isn't very scalable (it currently loads all the IR into memory).  I haven't thought a lot about this, but I'd tackle that problem in three stages:</div>

<div><br></div><div>1. Our LTO model runs optimizations at both compile and link time, the compile-time optimizations should work as they do now IMO.  This is controversial though, because doing so could cause (e.g.) an inlining to happen "early" that would be seen as a bad idea with full LTO information.  The advantage of doing compile-time optimizations is that it both shrinks the IR, and speeds up an incremental rebuild by avoiding having to do simple optimizations again.</div>

<div><br></div><div>2. At LTO time, the bottom-up processing of the callgraph is still goodness and presents good locality (unless you have very very large SCC's).  The tweak that we'd have to implement is lazy deserialization (already implemented) and reserialization to disk (which is missing).  With this, you get much better memory footprint than "hold everything in memory at once".</div>

</div></div></blockquote><div><br></div><div><br></div><div>IR is just one memory consumer.  In LTO, there are also global data structures : global symtab, global type table, call graph, points-to graph, mod-ref info etc. In a compiler I worked with before, serialization is done on points-to graph and mod-ref info after the info is mapped from a global view to a per TU local view, and IR for each TU is mapped/unmapped on demand. The type/symbol info per TU is in different segment from the the code segment.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div style="word-wrap:break-word"><div><div><br></div><div>3. To support multiple cores/machines, you break the callgraph SCC DAG into parallel chunks that can be farmed out.  There is a lot of parallelism in a DAG.</div>

<div><br></div></div></div></blockquote><div><br></div><div>Parallelism distributed across machines can be tricky -- involving lots of overhead such as rpc and data passing.</div><div><br></div><div>You may also be surprised with side effects due to the parallelism using multi-core -- thrashing due to memory contention -- some per function level pass may use lots of memory for temporary data structure. It won't scale for the compiler workload -- i.e. get 2x speedup using 8 core.</div>

<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div style="word-wrap:break-word"><div><div></div><div><br></div><div>I understand that, but that implies that you have some model for code locality.  Setting a global code growth limit is (in my opinion) a hack unless you are aiming for the whole program to fit in the icache (which I don't think anyone tries to do :).</div>

<div><br></div></div></div></blockquote><div><br></div><div><br></div><div>Yes, global growth limit may be good for size control, but is a hack for control icache footprint. However, as I mentioned, the bottom up inline scheme make it impossible to use any heuristics involving 'global limit' which can be more complicated and fancier than the simple growth limit.  For instance, there is no restriction that only one global limit can be used ---  the compiler can partition the call graph into multiple locality regions, and set icache limit for each region. The inlining order can be done on a region by region basis. For each region, the region limit is applied and the priority queue must be used.</div>

<div><br></div><div>Thanks,</div><div><br></div><div>David</div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div style="word-wrap:break-word">

<div><div></div><div>With any other limit that is higher than your icache size, you are basically picking an *arbitrary* limit that is not based on the machine model or the instruction locality of the program. </div><div class="im">

<div><br></div><blockquote type="cite"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-color:rgb(204, 204, 204);border-left-style:solid;padding-left:1ex">


The "punish functions containing lots of calls" is based on the assumption that functions which are mostly calls (again, this decision happens after the callee has been inlined and simplified) aren't themselves doing much work.<br>


</blockquote><div><br></div><div>My point is that using static count of callsites as a indicator for this can be misleading. All the calls may be calls to cold external functions for instance.</div></div></blockquote><br>

</div></div><div>Absolutely true.  It may also be completely wrong for some functions.  It's a heuristic :)</div><div><br></div><font color="#888888"><div>-Chris</div><br></font></div></blockquote></div><br>