[PATCH] D18560: [TTI] Add getInliningThresholdMultiplier.

Fri Apr 1 14:43:42 PDT 2016

jlebar added a comment.

> If you'd like a target-specific cost adjustment, which seems perfectly reasonable, it should be additive,

One of the other reasons I think a multiplicative setting makes sense is, there's no way to change the fudge factor.  All you can change is the base inlining threshold.  So the question is, as a user changing the inlining threshold, what do I expect to happen on a platform that has an inlining threshold adjustment?

Suppose I set the (global, un-fudged) inlining threshold (again, the only thing I can change) to something very small.  It seems to me that should imply that relatively little inlining occurs.  But of course an additive fudge doesn't accomplish that.  And in fact there's no good way for me to set the post-fudge inlining threshold to a small value -- I'd have to set it to a negative number, and I'd have to calculate that specific value by looking at the TTI.

In contrast, if the fudge is multiplicative, then we have the property that a small pre-fudge inlining threshold results in a small post-fudge threshold.

If a multipliciative fudge is off the table, I think I'd feel more comfortable with the TTI giving an absolute inlining threshold that can be overridden on the command line.  Is that any more appealing to you, Hal?

> should be a function of the call site (i.e. both the caller and the callee, not just the caller).

Sure, I can make that change once we agree on the rest here.

> Also, hopefully the rationale for these numbers can be clearly articulated in comments in the TTI implementation, so we know how to adjust them if we change how TTI.getUserCost works.

Haha, just like the current cross-arch threshold is so clearly motivated?  :-p  I don't have a good rationale other than, this number worked well for many internal benchmarks at Google.  It's absolutely not scientifically arrived at; it's just a starting place.

Clearly we want more inlining on nvptx than on other archs, and, anecdotally, I tried a few benchmarks that showed 10% gains from this change.  But ptxas does its own inlining, and it's a black box, so this is all very mushy.  And more importantly, speed is only half the story -- inlining big functions is a size/speed tradeoff, and that's harder to quantify.  Like, GPUs tend to have relatively little code to begin with, so maybe we have more wiggle room on the size end?  But how much -- 1.5x, 5x, 100x?  Who knows.

http://reviews.llvm.org/D18560