[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Mon Jul 24 16:36:52 PDT 2017

Hi Quentin,
  I understand your points and I believe that some meaning is being lost
via email. For performance it's true that that cost isn't necessarily
modeled, there is currently only support for using profile data to avoid
mitigate that. I was working under the assumption, possibly incorrectly,
that at Oz we favor small code over anything else including runtime
performance. This is why we only run at Os if we have profile data, and
even then we are very conservative about what we outline from.
  I also understand that some target hooks may be required for certain
things, it happens for other IR level passes as well. I just want to
minimize that behavior as much as I can, though thats a personal choice.
As for a motivating reason to push for this, it actually solves some of the
problems that you brought up in the post for the original Machine Outliner
RFC. If I can quote you:

"So basically, better cost model. That's one part of the story.

The other part is at the LLVM IR level or before register allocation
identifying similar code sequence is much harder, at least with a suffix
tree like algorithm. Basically the problem is how do we name our
instructions such that we can match them.
Let me take an example.
foo() {
/* bunch of code */
a = add b, c;
d = add e, f;
}

bar() {
d = add e, g;
f = add c, w;
}

With proper renaming, we can outline both adds in one function. The
difficulty is to recognize that they are semantically equivalent to give
them the same identifier in the suffix tree. I won’t get into the details
but it gets tricky quickly. We were thinking of reusing GVN to have such
identifier if we wanted to do the outlining at IR level but solving this
problem is hard."

This outliner will catch your case, it is actually one of the easiest cases
for it to catch. The outliner can recognize semantically equivalent
instructions and can be expanded to catch even more.

As for the cost model it is quite simple:
 - We identify all of the external inputs into the sequence. For estimating
the benefit we constant fold and condense the inputs so that we can get the
set of *unique* inputs into the sequence.
 - We also identify outputs from the sequence, instructions that have
external references. We add the cost of storing/loading/passing output
parameter to the outlined function.
 - We identify one output to promote to a return value for the function.
This can end up reducing an output parameter that is passed to our outlined
function.
 - We estimate the cost of the sequence of instructions by mostly using TTI.
 - With the above we can estimate the cost per occurrence as well as the
cost for the new function. Of course these are mostly estimates, we lean
towards the conservative side to be safe.
 - Finally we can compute an estimated benefit for the sequence taking into
account benefit per occurrence and the estimated cost of the new function.

There is definitely room for improvement in the cost model. I do not
believe its the best but its shown to be somewhat reliable for most cases.
It has benefits and it has regressions, as does the machine outliner. The
reasoning for adding the new framework, is because I believe that this
transformation should exist at the IR level. Not only because it is the
simplest place to put it but also because it provides the greatest
opportunities for improvement with the largest coverage. We can expand the
framework to catch Danny's cases from the machine outliner RFC, we can add
region outlining, etc. We can do all of this and make it available to every
target automatically. The reason why I am for this being at the IR level is
not because I want to split up the technical effort, but combine it.

Thanks,
  River Riddle

On Mon, Jul 24, 2017 at 4:14 PM, Quentin Colombet <qcolombet at apple.com>
wrote:

> Hi River,
>
> On Jul 24, 2017, at 2:36 PM, River Riddle <riddleriver at gmail.com> wrote:
>
> Hi Quentin,
>  I appreciate the feedback. When I reference the cost of Target Hooks it's
> mainly for maintainability and cost on a target author. We want to keep the
> intrusion into target information minimized. The heuristics used for the
> outliner are the same used by any other IR level pass seeking target
> information, i.e TTI for the most part. I can see where you are coming from
> with "having heuristics solely focused on code size do not seem
> realistic", but I don't agree with that statement.
>
>
> If you only want code size I agree it makes sense, but I believe, even in
> Oz, we probably don’t want to slow the code by a big factor for a couple
> bytes. That’s what I wanted to say and what I wanted to point out is that
> you need to have some kind of model for the performance to avoid those
> worst cases. Unless we don’t care :).
>
> I think there is a disconnect on heuristics. The only user tunable
> parameters are the lower bound parameters(to the cost model), the actual
> analysis(heuristic calculation) is based upon TTI information.
>
>
> I don’t see how you can get around adding more hooks to know how a
> specific function prototype is going to be lowered (e.g., i64 needs to be
> split into two registers, fourth and onward parameters need to be pushed on
> the stack and so on). Those change the code size benefit.
>
>  When you say "Would still be interesting to see how well this could
> perform on some exact model (i.e., at the Machine level), IMO." I am
> slightly confused as to what you mean. I do not intend to try and implement
> this algorithm at the MIR level given that it exists in Machine Outliner.
>
>
> Of course, I don’t expect you to do that :). What I meant is that the
> claim that IR offers the better trade off is not based on hard evidences. I
> actually don’t buy it.
>
> My point was to make sure I understand what you are trying to solve and
> given you have mentioned the MachineOutliner, why you are not working on
> improving it instead of suggesting a new framework.
> Don’t take me wrong, maybe creating a new framework at the IR level is the
> right thing to do, but I still didn’t get that from your comments.
>
> There are several comparison benchmarks given in the "More detailed
> performance data" of the original RFC. It includes comparisons to the
> Machine Outliner when possible(I can't build clang on Linux with Machine
> Outliner). I welcome any and all discussion on the placement of the
> outliner in LLVM.
>
>
> My fear with a new framework is that we are going to split the effort for
> pushing the outliner technology forward and I’d like to avoid that if at
> all possible.
>
> Now, to be more concrete on your proposal, could you describe the cost
> model for deciding what to outline? (Really the cost model, not the suffix
> algo.)
> Are outlined functions pushed into the list candidates for further
> outlining?
>
> Cheers,
> -Quentin
>
>  Thanks,
> River Riddle
>
> On Mon, Jul 24, 2017 at 1:42 PM, Quentin Colombet <qcolombet at apple.com>
> wrote:
>
>> Hi River,
>>
>> On Jul 24, 2017, at 11:55 AM, River Riddle via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> Hi Jessica,
>>  The comparison to the inliner is an interesting one but we think it's
>> important to note the difference in the use of heuristics. The inliner is
>> juggling many different tasks at the same time, execution speed, code size,
>> etc. which can cause the parameters to be very sensitive depending on the
>> benchmark/platform/etc. The outliners heuristics are focused solely on the
>> potential code size savings from outlining, and is thus only sensitive to
>> the current platform. This only creates a problem when we are over
>> estimating the potential cost of a set of instructions for a particular
>> target. The cost model parameters are only minimums: instruction sequence
>> length, estimated benefit, occurrence amount. The heuristics themselves are
>> conservative and based upon all of the target information available at the
>> IR level, the parameters are just setting a lower bound to weed out any
>> outliers. You are correct in that being at the machine level, before or
>> after RA, will give the most accurate heuristics but we feel there's an
>> advantage to being at the IR level. At the IR level we can do so many more
>> things that are either too difficult/complex for the machine level(e.g
>> parameterization/outputs/etc). Not only can we do these things but they are
>> available on all targets immediately, without the need for target hooks.
>> The caution on the use of heuristics is understandable, but there comes a
>> point when trade offs need to be made. We made the trade off for a loss in
>> exact cost modeling to gain flexibility, coverage, and potential for
>> further features. This trade off is the same made for quite a few IR level
>> optimizations, including inlining. As for the worry about code size
>> regressions, so far the results seem to support our hypothesis.
>>
>>
>> Would still be interesting to see how well this could perform on some
>> exact model (i.e., at the Machine level), IMO. Target hooks are cheap and
>> choosing an implementation because it is simpler might not be the right
>> long term solution.
>> At the very least, to know what trade-off we are making, having
>> prototypes with the different approaches sounds sensible.
>> In particular, all the heuristics about cost for parameter passing
>> (haven’t checked how you did it) sounds already complex enough and would
>> require target hooks. Therefore, I am not seeing a clear win with an IR
>> approach here.
>>
>> Finally, having heuristics solely focused on code size do not seem
>> realistic. Indeed, I am guessing you have some thresholds to avoid
>> outlining some piece of code too small that would end up adding a whole lot
>> of indirections and I don’t like magic numbers in general :).
>>
>> To summarize, I wanted to point out that an IR approach is not as a clear
>> win as you describe and would thus deserve more discussion.
>>
>> Cheers,
>> -Quentin
>>
>>  Thanks,
>> River Riddle
>>
>> On Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette <jpaquette at apple.com>
>> wrote:
>>
>>>
>>> Hi River,
>>>
>>> I’m working on the MachineOutliner pass at the MIR level. Working at the
>>> IR level sounds interesting! It also seems like our algorithms are similar.
>>> I was thinking of taking the suffix array route with the MachineOutliner in
>>> the future.
>>>
>>> Anyway, I’d like to ask about this:
>>>
>>> On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>> The downside to having this type of transformation be at the IR level is
>>> it means there will be less accuracy in the cost model -  we can somewhat
>>> accurately model the cost per instruction but we can’t get information on
>>> how a window of instructions may lower. This can cause regressions
>>> depending on the platform/codebase, therefore to help alleviate this there
>>> are several tunable parameters for the cost model.
>>>
>>>
>>> The inliner is threshold-based and it can be rather unpredictable how it
>>> will impact the code size of a program. Do you have any idea as to how
>>> heuristics/thresholds/parameters could be tuned to prevent this? In my
>>> experience, making good code size decisions with these sorts of passes
>>> requires a lot of knowledge about what instructions you’re dealing with
>>> exactly. I’ve seen the inliner cause some pretty serious code size
>>> regressions in projects due to small changes to the cost model/parameters
>>> which cause improvements in other projects. I’m a little worried that an
>>> IR-level outliner for code size would be doomed to a similar fate.
>>>
>>> Perhaps it would be interesting to run this sort of pass pre-register
>>> allocation? This would help pull you away from having to use heuristics,
>>> but give you some more opportunities for finding repeated instruction
>>> sequences. I’ve thought of doing something like this in the future with the
>>> MachineOutliner and seeing how it goes.
>>>
>>> - Jessica
>>>
>>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/1675bb23/attachment.html>