[PATCH] D18560: [TTI] Add getInliningThresholdMultiplier.

Thu Apr 14 11:22:10 PDT 2016

hfinkel added a comment.

In http://reviews.llvm.org/D18560#397232, @jlebar wrote:

> > Consider an out-of-tree target with a carefully tuned inlining threshold multiplier. If lots of targets do this, changing the threshold could become extremely problematic because small changes would would still disturb the target-specific tunings.
>
>
> My thought was, the inlining multiplier is such a coarse knob -- set atop rather coarse heuristics -- that tuning it particularly carefully for a target might not make much sense.
>
> I don't know if that's true or not, but it's at least an argument one could make.
>
> > His suggestion was to use an absolute threshold from the target in the absense of an explicitly specified command line flag.  Thinking about this more, I think it still presents the same problem. Consider a change to the inliner that significantly changes the rate at which we inline things. It might be useful to be able to adjust the threshold when making the change to keep most inlining decisions neutral across targets.
>
>
> I guess this (and the previous one, to an extent) are a question of API stability guarantees.  To the extent that we don't promise to have a stable API for out-of-tree targets, we could change the inliner and at the same time break compilation for any out-of-tree targets that set the threshold, right.  They'll have to fix the error and recompute their inlining threshold.
>
> That doesn't seem so unreasonable to me -- it's not like we'd be doing this once a month, or even every release.  And as an out-of-tree target, you'd only be broken if you explicitly opted in to this tuning mechanism, which we could make clear comes with this possibility of future intentional breakage.

This is a concern, and you're right that any of the tuning knobs need to be returned in response to upstream changes. Fundamentally, I'm most concerned with sending the right message here, and not unnecessarily constraining our future actions (because while we're certainly free to change whatever we want in the future, regardless of backend tuning, in practice we try to be nice and accommodating because these things take time).

The message we want to send is:

1. This is a temporary hook, and while some mechanism to achieve this affect will be provided in the future, it might look nothing like this.
2. This is a coarse heuristic mechanism, not really backed by any self-consistent model.

After chatting with Chandler about this, my understanding of the current mechanism is this: Our current heuristic is size based (with some exceptions), and aims to answer the following question: Will inlining this function make the caller smaller?

Doing this involves two pieces. The first is constructing an estimate of the size of the inlined function (which is truncated at some point to keep the cost of the analysis from becoming quadratic), accounting for constant propagation via function arguments and resulting simplifications. Then we construct an estimate of the caller's size associated with the call site that can be eliminated by inlining (the call itself, allocas that can be eliminated, etc.).

The second piece is more empirical. How many instructions, on average, will be eliminated in the caller by resulting simplification by inlining a function? This is where the current threshold comes in, it is a (rough) estimate of this. Unfortunately, as currently used, it is also not purely an estimate of this, but also accounts for performance effects (which is why, for instance, we have a bonus for callees with vector instructions, etc.).

Longer term, we really need to separate the performance-based metric from the size-based metric. They're estimating different things, although we're unfortunately conflating them currently (by hiding our performance-based metric inside our estimate of size decrease from resulting caller simplification, and to a lesser extent, things like the call penalty).

For your use case, GPUs, you want to represent the fact that calls are really expensive, and you don't care nearly as much about code size, although you do care about compile time. As a result, your limit is more to limit the compile time from later stages of the compiler (and perhaps the inliner itself) than anything else. That having been said, NVIDIA GPUs do have an icache that holds some kilobytes of compiled code, and spilling out of that will cost you (although often not a huge amount).

In the long term, I imagine we want a more top-down heuristic for this kind of inlining, because we want to keep a function, or group of functions, below some absolute size threshold (unless there is a huge compensating performance speedup). The size based heuristic is also good for your use case, but only if it happens in a more pure sense.

So my conclusion is that all of this is wrong ;) -- but in the name of making progress, I'm fine with a way to bump the threshold for some targets. I think it should be just a number, because that's the simplest possible thing we can do. A number if difficult, however, because cannot directly account for -Os, etc. factors that go into the current threshold. Providing a fraction to be multiplied by the current threshold seems problematic because it sends the message that this is something that should be finely tuned, and it shouldn't be. Could this just be an integer multipler for now? That will prevent fine tuning, send the message that this is a temporary coarse knob, and then we can work on a more principled solution. If that would work for you, I think that is the best solution for now.

> 

> 

> > I think you'll want to do some "science" on this value eventually and document it.

> 

> 

> Yes, absolutely.  That change doesn't even appear in this patch, though; let's worry about it separately, if that's OK?

http://reviews.llvm.org/D18560