[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Mon Jul 24 21:05:33 PDT 2017

On Mon, Jul 24, 2017 at 8:31 PM, River Riddle via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Hey Quentin,
>  To clarify some of your concerns/questions:
> - Currently parameter/output passing are based around the simple cost=1,
> though this is something that will need to change a bit, as you have
> pointed out. As I said, the cost model isn't perfect.
> - The output that is turned into a return is either: one that will remove
> an output parameter from the function call, or the last output in the
> instruction sequence.
> - As for being conservative with the cost, we prefer to overestimate the
> cost of outlining(New function cost) and under estimate the cost of the
> instruction sequence when possible. This means that we tend to under
> estimate the benefit, given that we are working with approximate costs.
> - When I speak of regressions I am referring to the size of the text
> section.
> - The gist of identifying semantic equivalence is the very first sentence
> of Candidate Selection in the initial RFC. We identify equivalence
> generally via Type,Opcode, operand types, any special state. This is just
> the current definition, we can relax this even further if we want to.
> - Working at the IR level has more target independent advantages than just
> reducing the amount of new hooks. This can be seen in the fact that: we
> don't require mno-red-zone or any other abi flag/check, we can easily
> parameterize sequences, generate outputs from sequences, etc.
> - By combining work I mean that the ability for adding advancements are
> much easier. if someone wants to add functionality for outlining regions
> they don't have to worry about any target specific patchups. It just works,
> for every target.
>

I also think it would be substantially difficult to implement the
parameterized outlining at machine level. Do you have any numbers on how
much the benefit comes from parameterization?

-- Sean Silva

> Both outliners require hooks to get better performance but with IR
> outliner these hooks are limited to the cost model and not the outlining
> process itself.
> Even though we disagree, I do appreciate the discussion and feedback!
>  Thanks,
> River Riddle
>
> On Mon, Jul 24, 2017 at 6:24 PM, Quentin Colombet <qcolombet at apple.com>
> wrote:
>
>> Hi River,
>>
>> Thanks for the detailed explanation.
>>
>> If people are okay for you to move forward, like I said to Andrey, I
>> won’t oppose. I feel sad we have to split our effort on outlining
>> technology, but I certainly don’t pretend to know what is best!
>> The bottom line is if people are happy with that going in, the
>> conversation on the details can continue in parallel.
>>
>> On Jul 24, 2017, at 4:56 PM, River Riddle <riddleriver at gmail.com> wrote:
>>
>> Hey Quentin,
>>  Sorry I missed the last question. Currently it doesn't do continual
>> outlining, but it's definitely a possibility.
>>
>>
>> Ok, that means we probably won’t have the very bad runtime problems I had
>> in mind with adding a lot of indirections.
>>
>> Thanks,
>> River Riddle
>>
>> On Mon, Jul 24, 2017 at 4:36 PM, River Riddle <riddleriver at gmail.com>
>> wrote:
>>
>>> Hi Quentin,
>>>   I understand your points and I believe that some meaning is being lost
>>> via email. For performance it's true that that cost isn't necessarily
>>> modeled, there is currently only support for using profile data to avoid
>>> mitigate that. I was working under the assumption, possibly incorrectly,
>>> that at Oz we favor small code over anything else including runtime
>>> performance. This is why we only run at Os if we have profile data, and
>>> even then we are very conservative about what we outline from.
>>>   I also understand that some target hooks may be required for certain
>>> things, it happens for other IR level passes as well. I just want to
>>> minimize that behavior as much as I can, though thats a personal choice.
>>> As for a motivating reason to push for this, it actually solves some of
>>> the problems that you brought up in the post for the original Machine
>>> Outliner RFC. If I can quote you:
>>>
>>> "So basically, better cost model. That's one part of the story.
>>>
>>> The other part is at the LLVM IR level or before register allocation
>>> identifying similar code sequence is much harder, at least with a suffix
>>> tree like algorithm. Basically the problem is how do we name our
>>> instructions such that we can match them.
>>> Let me take an example.
>>> foo() {
>>> /* bunch of code */
>>> a = add b, c;
>>> d = add e, f;
>>> }
>>>
>>> bar() {
>>> d = add e, g;
>>> f = add c, w;
>>> }
>>>
>>> With proper renaming, we can outline both adds in one function. The
>>> difficulty is to recognize that they are semantically equivalent to give
>>> them the same identifier in the suffix tree. I won’t get into the details
>>> but it gets tricky quickly. We were thinking of reusing GVN to have such
>>> identifier if we wanted to do the outlining at IR level but solving this
>>> problem is hard."
>>>
>>> This outliner will catch your case, it is actually one of the easiest
>>> cases for it to catch. The outliner can recognize semantically equivalent
>>> instructions and can be expanded to catch even more.
>>>
>>
>> Interesting, could you explain how you do that?
>> I didn’t see it in the original post.
>>
>>
>>> As for the cost model it is quite simple:
>>>  - We identify all of the external inputs into the sequence. For
>>> estimating the benefit we constant fold and condense the inputs so that we
>>> can get the set of *unique* inputs into the sequence.
>>>
>>
>> Ok, those are your parameters. How do you account for the cost of setting
>> up those parameters?
>>
>>  - We also identify outputs from the sequence, instructions that have
>>> external references. We add the cost of storing/loading/passing output
>>> parameter to the outlined function.
>>>
>>
>> Ok, those are your return values (or your output parameter). I see the
>> cost computation you do on those, but it still miss the general parameter
>> setup cost.
>>
>>  - We identify one output to promote to a return value for the function.
>>> This can end up reducing an output parameter that is passed to our outlined
>>> function.
>>>
>>
>> How do you choose that one?
>>
>>  - We estimate the cost of the sequence of instructions by mostly using
>>> TTI.
>>>
>>
>> Sounds sensible. (Although I am not a fan of the whole TTI thing :)).
>>
>>  - With the above we can estimate the cost per occurrence as well as the
>>> cost for the new function. Of course these are mostly estimates, we lean
>>> towards the conservative side to be safe.
>>>
>>
>> Conservative in what sense? Put differently how do you know your cost is
>> conservative?
>>
>>  - Finally we can compute an estimated benefit for the sequence taking
>>> into account benefit per occurrence and the estimated cost of the new
>>> function.
>>>
>>
>> Make sense.
>>
>>
>>> There is definitely room for improvement in the cost model. I do not
>>> believe its the best but its shown to be somewhat reliable for most cases.
>>> It has benefits and it has regressions, as does the machine outliner.
>>>
>>
>> Regressions in what sense?
>> Do you actually have functions that are bigger?
>>
>> To clarify, AFAIR, the machine outliner does not have such regressions
>> per say. The outliner does perfectly what the cost model predicted:
>> functions are individually smaller. Code size grow may happen because of
>> padding in the object file (and getting in the way of some linker
>> optimization).
>>
>> The reasoning for adding the new framework, is because I believe that
>>> this transformation should exist at the IR level. Not only because it is
>>> the simplest place to put it but also because it provides the greatest
>>> opportunities for improvement with the largest coverage.
>>>
>>
>> Again I don’t get those arguments. Machine passes can be target
>> independent. Sure they may require the targets to adopt new hooks but
>> that’s about it and I don’t see how this is different from having to add
>> adopt hooks from TTI. I am guessing you can mostly achieve your goals with
>> the existing TTI hooks, but the arguments passing ones, which is the part I
>> missed in your explanation :).
>>
>> We can expand the framework to catch Danny's cases from the machine
>>> outliner RFC, we can add region outlining, etc. We can do all of this and
>>> make it available to every target automatically. The reason why I am for
>>> this being at the IR level is not because I want to split up the technical
>>> effort, but combine it.
>>>
>>
>> I fail to see the combining part here :).
>>
>> To give you a bit context, I am afraid that, indeed, the implementation
>> is going to be simpler (partly because a lot of people are more familiar
>> with LLVM IR than MachineInstr) and for the most part TTI is going to be
>> good enough. However, when we will want to go the extra mile and do crazy
>> stuff, we will, like with the vectorizer IMHO, hit that wall of TTI
>>  abstractions that we are going to work around in various ways. I’d like,
>> if at all possible, to avoid repeating the history here.
>>
>> Cheers,
>> -Quentin
>>
>>
>>> Thanks,
>>>   River Riddle
>>>
>>>
>>> On Mon, Jul 24, 2017 at 4:14 PM, Quentin Colombet <qcolombet at apple.com>
>>> wrote:
>>>
>>>> Hi River,
>>>>
>>>> On Jul 24, 2017, at 2:36 PM, River Riddle <riddleriver at gmail.com>
>>>> wrote:
>>>>
>>>> Hi Quentin,
>>>>  I appreciate the feedback. When I reference the cost of Target Hooks
>>>> it's mainly for maintainability and cost on a target author. We want to
>>>> keep the intrusion into target information minimized. The heuristics used
>>>> for the outliner are the same used by any other IR level pass seeking
>>>> target information, i.e TTI for the most part. I can see where you are
>>>> coming from with "having heuristics solely focused on code size do not
>>>> seem realistic", but I don't agree with that statement.
>>>>
>>>>
>>>> If you only want code size I agree it makes sense, but I believe, even
>>>> in Oz, we probably don’t want to slow the code by a big factor for a couple
>>>> bytes. That’s what I wanted to say and what I wanted to point out is that
>>>> you need to have some kind of model for the performance to avoid those
>>>> worst cases. Unless we don’t care :).
>>>>
>>>> I think there is a disconnect on heuristics. The only user tunable
>>>> parameters are the lower bound parameters(to the cost model), the actual
>>>> analysis(heuristic calculation) is based upon TTI information.
>>>>
>>>>
>>>> I don’t see how you can get around adding more hooks to know how a
>>>> specific function prototype is going to be lowered (e.g., i64 needs to be
>>>> split into two registers, fourth and onward parameters need to be pushed on
>>>> the stack and so on). Those change the code size benefit.
>>>>
>>>>  When you say "Would still be interesting to see how well this could
>>>> perform on some exact model (i.e., at the Machine level), IMO." I am
>>>> slightly confused as to what you mean. I do not intend to try and implement
>>>> this algorithm at the MIR level given that it exists in Machine Outliner.
>>>>
>>>>
>>>> Of course, I don’t expect you to do that :). What I meant is that the
>>>> claim that IR offers the better trade off is not based on hard evidences. I
>>>> actually don’t buy it.
>>>>
>>>> My point was to make sure I understand what you are trying to solve and
>>>> given you have mentioned the MachineOutliner, why you are not working on
>>>> improving it instead of suggesting a new framework.
>>>> Don’t take me wrong, maybe creating a new framework at the IR level is
>>>> the right thing to do, but I still didn’t get that from your comments.
>>>>
>>>> There are several comparison benchmarks given in the "More detailed
>>>> performance data" of the original RFC. It includes comparisons to the
>>>> Machine Outliner when possible(I can't build clang on Linux with Machine
>>>> Outliner). I welcome any and all discussion on the placement of the
>>>> outliner in LLVM.
>>>>
>>>>
>>>> My fear with a new framework is that we are going to split the effort
>>>> for pushing the outliner technology forward and I’d like to avoid that if
>>>> at all possible.
>>>>
>>>> Now, to be more concrete on your proposal, could you describe the cost
>>>> model for deciding what to outline? (Really the cost model, not the suffix
>>>> algo.)
>>>> Are outlined functions pushed into the list candidates for further
>>>> outlining?
>>>>
>>>> Cheers,
>>>> -Quentin
>>>>
>>>>  Thanks,
>>>> River Riddle
>>>>
>>>> On Mon, Jul 24, 2017 at 1:42 PM, Quentin Colombet <qcolombet at apple.com>
>>>> wrote:
>>>>
>>>>> Hi River,
>>>>>
>>>>> On Jul 24, 2017, at 11:55 AM, River Riddle via llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>> Hi Jessica,
>>>>>  The comparison to the inliner is an interesting one but we think it's
>>>>> important to note the difference in the use of heuristics. The inliner is
>>>>> juggling many different tasks at the same time, execution speed, code size,
>>>>> etc. which can cause the parameters to be very sensitive depending on the
>>>>> benchmark/platform/etc. The outliners heuristics are focused solely on the
>>>>> potential code size savings from outlining, and is thus only sensitive to
>>>>> the current platform. This only creates a problem when we are over
>>>>> estimating the potential cost of a set of instructions for a particular
>>>>> target. The cost model parameters are only minimums: instruction sequence
>>>>> length, estimated benefit, occurrence amount. The heuristics themselves are
>>>>> conservative and based upon all of the target information available at the
>>>>> IR level, the parameters are just setting a lower bound to weed out any
>>>>> outliers. You are correct in that being at the machine level, before or
>>>>> after RA, will give the most accurate heuristics but we feel there's an
>>>>> advantage to being at the IR level. At the IR level we can do so many more
>>>>> things that are either too difficult/complex for the machine level(e.g
>>>>> parameterization/outputs/etc). Not only can we do these things but they are
>>>>> available on all targets immediately, without the need for target hooks.
>>>>> The caution on the use of heuristics is understandable, but there comes a
>>>>> point when trade offs need to be made. We made the trade off for a loss in
>>>>> exact cost modeling to gain flexibility, coverage, and potential for
>>>>> further features. This trade off is the same made for quite a few IR level
>>>>> optimizations, including inlining. As for the worry about code size
>>>>> regressions, so far the results seem to support our hypothesis.
>>>>>
>>>>>
>>>>> Would still be interesting to see how well this could perform on some
>>>>> exact model (i.e., at the Machine level), IMO. Target hooks are cheap and
>>>>> choosing an implementation because it is simpler might not be the right
>>>>> long term solution.
>>>>> At the very least, to know what trade-off we are making, having
>>>>> prototypes with the different approaches sounds sensible.
>>>>> In particular, all the heuristics about cost for parameter passing
>>>>> (haven’t checked how you did it) sounds already complex enough and would
>>>>> require target hooks. Therefore, I am not seeing a clear win with an IR
>>>>> approach here.
>>>>>
>>>>> Finally, having heuristics solely focused on code size do not seem
>>>>> realistic. Indeed, I am guessing you have some thresholds to avoid
>>>>> outlining some piece of code too small that would end up adding a whole lot
>>>>> of indirections and I don’t like magic numbers in general :).
>>>>>
>>>>> To summarize, I wanted to point out that an IR approach is not as a
>>>>> clear win as you describe and would thus deserve more discussion.
>>>>>
>>>>> Cheers,
>>>>> -Quentin
>>>>>
>>>>>  Thanks,
>>>>> River Riddle
>>>>>
>>>>> On Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette <
>>>>> jpaquette at apple.com> wrote:
>>>>>
>>>>>>
>>>>>> Hi River,
>>>>>>
>>>>>> I’m working on the MachineOutliner pass at the MIR level. Working at
>>>>>> the IR level sounds interesting! It also seems like our algorithms are
>>>>>> similar. I was thinking of taking the suffix array route with the
>>>>>> MachineOutliner in the future.
>>>>>>
>>>>>> Anyway, I’d like to ask about this:
>>>>>>
>>>>>> On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev <
>>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>>
>>>>>> The downside to having this type of transformation be at the IR level
>>>>>> is it means there will be less accuracy in the cost model -  we can
>>>>>> somewhat accurately model the cost per instruction but we can’t get
>>>>>> information on how a window of instructions may lower. This can cause
>>>>>> regressions depending on the platform/codebase, therefore to help alleviate
>>>>>> this there are several tunable parameters for the cost model.
>>>>>>
>>>>>>
>>>>>> The inliner is threshold-based and it can be rather unpredictable how
>>>>>> it will impact the code size of a program. Do you have any idea as to how
>>>>>> heuristics/thresholds/parameters could be tuned to prevent this? In
>>>>>> my experience, making good code size decisions with these sorts of passes
>>>>>> requires a lot of knowledge about what instructions you’re dealing with
>>>>>> exactly. I’ve seen the inliner cause some pretty serious code size
>>>>>> regressions in projects due to small changes to the cost model/parameters
>>>>>> which cause improvements in other projects. I’m a little worried that an
>>>>>> IR-level outliner for code size would be doomed to a similar fate.
>>>>>>
>>>>>> Perhaps it would be interesting to run this sort of pass pre-register
>>>>>> allocation? This would help pull you away from having to use heuristics,
>>>>>> but give you some more opportunities for finding repeated instruction
>>>>>> sequences. I’ve thought of doing something like this in the future with the
>>>>>> MachineOutliner and seeing how it goes.
>>>>>>
>>>>>> - Jessica
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/99530c30/attachment-0001.html>