[PATCH] D18560: [TTI] Add getInliningThresholdMultiplier.

Mon Apr 11 09:37:33 PDT 2016

chandlerc added a comment.

In http://reviews.llvm.org/D18560#389927, @jlebar wrote:

> > If you'd like a target-specific cost adjustment, which seems perfectly reasonable, it should be additive,
>
>
> One of the other reasons I think a multiplicative setting makes sense is, there's no way to change the fudge factor.  All you can change is the base inlining threshold.  So the question is, as a user changing the inlining threshold, what do I expect to happen on a platform that has an inlining threshold adjustment?
>
> Suppose I set the (global, un-fudged) inlining threshold (again, the only thing I can change) to something very small.  It seems to me that should imply that relatively little inlining occurs.  But of course an additive fudge doesn't accomplish that.  And in fact there's no good way for me to set the post-fudge inlining threshold to a small value -- I'd have to set it to a negative number, and I'd have to calculate that specific value by looking at the TTI.
>
> In contrast, if the fudge is multiplicative, then we have the property that a small pre-fudge inlining threshold results in a small post-fudge threshold.
>
> If a multipliciative fudge is off the table, I think I'd feel more comfortable with the TTI giving an absolute inlining threshold that can be overridden on the command line.  Is that any more appealing to you, Hal?

I ended up chatting with Hal about this and he made a really great point about this. I had been thinking that it is really brittle to have the target provided inlining threshold be an absolute number instead of a multiplier / ratio as you have it.

However, Hal pointed out that this creates a coupling that could also be problematic. Consider an out-of-tree target with a carefully tuned inlining threshold multiplier. If lots of targets do this, changing the threshold could become extremely problematic because small changes would would still disturb the target-specific tunings. His suggestion was to use an absolute threshold from the target in the absense of an explicitly specified command line flag.

Thinking about this more, I think it still presents the same problem. Consider a change to the inliner that significantly changes the rate at which we inline things. It might be useful to be able to adjust the threshold when making the change to keep most inlining decisions neutral across targets.

I'm not completely sure that either of these approaches is really resistant to undue coupling...

One possible alternative is to not have the size-based inlining be target configurable, and to make this exclusively handled by the proposed runtime cost estimation based inlining when that is available. Perhaps this could be documented as a temporary hook that will be replaced with the runtime estimation? I'm curious what others think here.

> > Also, hopefully the rationale for these numbers can be clearly articulated in comments in the TTI implementation, so we know how to adjust them if we change how TTI.getUserCost works.

> 

> 

> Haha, just like the current cross-arch threshold is so clearly motivated?

I'm going to write up a document describing the current system since this keeps coming up and causing confusion. I'll also try to sketch how other models would fit into this.

> I don't have a good rationale other than, this number worked well for many internal benchmarks at Google.  It's absolutely not scientifically arrived at; it's just a starting place.

> 

> Clearly we want more inlining on nvptx than on other archs, and, anecdotally, I tried a few benchmarks that showed 10% gains from this change.  But ptxas does its own inlining, and it's a black box, so this is all very mushy.  And more importantly, speed is only half the story -- inlining big functions is a size/speed tradeoff, and that's harder to quantify.  Like, GPUs tend to have relatively little code to begin with, so maybe we have more wiggle room on the size end?  But how much -- 1.5x, 5x, 100x?  Who knows.

GPUs tend to not have significant icache-style bottlenecks because their kernels are typically small and easily predicted, etc. Still, there remains a fundamental limit on instruction working set size where inlining is profitable.

I think you'll want to do some "science" on this value eventually and document it. Typically we walk the value up and down and try to get an idea of the shape of the aggregate performance curve across a wide range of benchmarks. Then we look for what is usually a large flat mesa in the curve, and go with the small end of that.

-Chandler

http://reviews.llvm.org/D18560