<br><br><div class="gmail_quote">On Sun, Nov 28, 2010 at 2:37 PM, Chris Lattner <span dir="ltr"><<a href="mailto:clattner@apple.com">clattner@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class="im">On Nov 23, 2010, at 5:07 PM, Xinliang David Li wrote:<br>
> Hi, I browsed the LLVM inliner implementation, and it seems there is room for improvement. (I have not read it too carefully, so correct me if what I observed is wrong).<br>
><br>
> First the good side of the inliner -- the function level summary and inline cost estimation is more elaborate and complete than gcc. For instance, it considers callsite arguments and the effects of optimization enabled by inlining.<br>
<br>
</div>Yep, as others pointed out, this is intended to interact closely with the per-function optimizations that get mixed in due to the inliner being a callgraphscc pass. This is actually a really important property of the inliner. If you have a function foo that calls a leaf function bar, the sequence of optimization is:<br>
<br>
1. Run the inliner on bar (noop, since it has no call sites)<br>
2. Run the per-function passes on bar. This generally shrinks it, and prevents "abstraction penalty" from making bar look too big to inline.<br>
3. Run the inliner on foo. Since foo calls bar, we consider inlining bar into foo and do so if profitable.<br>
4. Run the per-function passes on foo. If bar got inlined, this means that we're running the per-function passes over the inlined contents of bar again.<br></blockquote><div><br></div><div>On-the-fly clean up (optimization) while doing bottom up inlining is nice as you described. Many other compilers chose not to do this way due to scalability concerns (with IPO) -- this can make the IPO the biggest bottom neck in terms of compile time (as it is serialized). Memory many not be a big issue for LLVM as I can see the good locality in pass manager. (Just curious, what is biggest application LLVM can build with IPO?)</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<br>
In a traditional optimizer like GCC's, you end up with problems where you have to set a high inline threshold due to inlining-before-optimizing causing "abstraction penalty problems". An early inliner is a hack that tries to address this. </blockquote>
<div><br></div><div>It is a hack in some sense (but a common practice) -- but enables other flexibilities. </div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Another problem with this approach from the compile time perspective is that you end up repeating work multiple times. For example, if there is a common subexpression in a small function, you end up inlining it into many places, then having to eliminate the common subexpression in each copy.<br>
</blockquote><div><br></div><div>Early inlining + scalar opt can do the same, right?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<br>
The LLVM inliner avoids these problems, but (as you point out) this really does force it to being a bottom-up inliner. This means that the bottom-up inliner needs to make decisions in strange ways in some cases: for example if qux called foo, and foo were static, then (when processing foo) we may decide not to inline bar into foo because it would be more profitable to inline foo into qux.<br>
<div class="im"><br>
> Now more to the weakness of the inliner:<br>
><br>
> 1) It is bottom up. The inlining is not done in the order based on the priority of the callsites. It may leave important callsites (at top of the cg) unlined due to higher cost after inline cost update. It also eliminates the possibility of inline specialization. To change this, the inliner pass may not use the pass manager infrastructure . (I noticed a hack in the inliner to workaround the problem -- for static functions avoid inlining its callees if it causes it to become too big ..)<br>
<br>
</div>This is true, but I don't think it's a really large problem in practice. We don't have a "global inline threshold limit" (which I've never understood, except as a hack to prevent run-away inlining) so not visiting in priority order shouldn't prevent high-priority-but-processed-late candidates from being inlined.<br>
</blockquote><div><br></div><div>global threshold can be used to control the unnecessary size growth. In some cases, the size increase may also cause increase in icache footprint leading to poor performance. In fact, with IPO/CMO, icache footprint can be modeled in some way and be used as one kind of global limit. </div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<br>
The only potential issue I'm aware of is if we have A->B->C and we decide to inline C into B when it would be more profitable to inline B into A and leave C out of line. This can be handled with a heuristic like the one above.<br>
<div class="im"><br>
> 2) There seems to be only one inliner pass. For calls to small functions, it is better to perform early inlining as one of the local (per function) optimizations followed by scalar opt clean up. This will sharpen the summary information. (Note the inline summary update does not consider the possible cleanup)<br>
<br>
</div>Yep. This is a feature :)<br>
<div class="im"><br>
> 3) recursive inlining is not supported<br>
<br>
</div>This is a policy decision. It's not clear whether it is really a good idea, though I have seen some bugzilla or something about it. I agree that it should be revisited.<br>
<div class="im"><br>
> 4) function with indirect branch is not inlined. What source construct does indirect branch instr correspond to ? variable jump?<br>
<br>
</div>See:<br>
<a href="http://blog.llvm.org/2010/01/address-of-label-and-indirect-branches.html" target="_blank">http://blog.llvm.org/2010/01/address-of-label-and-indirect-branches.html</a><br>
<br>
for more details.<br>
<div class="im"><br>
> 6) There is one heuristc used in inline-cost computation seems wrong:<br>
><br>
> // Calls usually take a long time, so they make the inlining gain smaller.<br>
> InlineCost += CalleeFI->Metrics.NumCalls * InlineConstants::CallPenalty;<br>
><br>
> Does it try to block inlining of callees with lots of calls? Note inlining such a function only increase static call counts.<br>
<br>
</div>I think that this is a heuristic that Jakob came up with, but I think it's a good one, also discussed elsewhere on the thread.<br>
<br>
When talking about inlining and tuning thresholds and heuristics, it is a good idea to quantify what the expected or possible wins of inlining a function are. Some of the ones I'm aware of:<br>
<br>
1. In some cases, inlining shrinks code.<br>
<br>
2. Inlining a function exposes optimization opportunities on the inlined code, because constant propagation and other simplifications can take place.<br>
<br>
3. Inlining a function exposes optimizations in the caller because address-taken values can be promoted to registers.<br>
<br>
4. Inlining a function can improve optimization in a caller because interprocedural side-effect analysis isn't needed. For example, load/call dependence may not be precise. This is something we should continue to improve in the optimizer though.<br>
<br>
5. Inlining code with indirect call sites and switches can improve branch prediction if some callers of the function are biased differently than other callers. This is pretty hard to predict without profile info though.<br>
<br></blockquote><div><br></div><div>Besides -- 1) reducing call overhead; 2) scheduling freedom; 3) enabling optimizations across inline instances of callee(s); 4) sharpening local analysis (mainly aliasing) results -- such as points to, malloc etc.</div>
<div><br></div><div>It may also lose aliasing assertion (such as restrict aliasing) if not done properly.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<br>
The "punish functions containing lots of calls" is based on the assumption that functions which are mostly calls (again, this decision happens after the callee has been inlined and simplified) aren't themselves doing much work.<br>
</blockquote><div><br></div><div>My point is that using static count of callsites as a indicator for this can be misleading. All the calls may be calls to cold external functions for instance.</div><div><br></div><div>Thanks,</div>
<div><br></div><div>David</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<font color="#888888"><br>
-Chris<br>
<br>
</font></blockquote></div><br>