[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Mon Jul 24 16:56:57 PDT 2017

Hey Quentin,
 Sorry I missed the last question. Currently it doesn't do continual
outlining, but it's definitely a possibility.
Thanks,
River Riddle

On Mon, Jul 24, 2017 at 4:36 PM, River Riddle <riddleriver at gmail.com> wrote:

> Hi Quentin,
>   I understand your points and I believe that some meaning is being lost
> via email. For performance it's true that that cost isn't necessarily
> modeled, there is currently only support for using profile data to avoid
> mitigate that. I was working under the assumption, possibly incorrectly,
> that at Oz we favor small code over anything else including runtime
> performance. This is why we only run at Os if we have profile data, and
> even then we are very conservative about what we outline from.
>   I also understand that some target hooks may be required for certain
> things, it happens for other IR level passes as well. I just want to
> minimize that behavior as much as I can, though thats a personal choice.
> As for a motivating reason to push for this, it actually solves some of
> the problems that you brought up in the post for the original Machine
> Outliner RFC. If I can quote you:
>
> "So basically, better cost model. That's one part of the story.
>
> The other part is at the LLVM IR level or before register allocation
> identifying similar code sequence is much harder, at least with a suffix
> tree like algorithm. Basically the problem is how do we name our
> instructions such that we can match them.
> Let me take an example.
> foo() {
> /* bunch of code */
> a = add b, c;
> d = add e, f;
> }
>
> bar() {
> d = add e, g;
> f = add c, w;
> }
>
> With proper renaming, we can outline both adds in one function. The
> difficulty is to recognize that they are semantically equivalent to give
> them the same identifier in the suffix tree. I won’t get into the details
> but it gets tricky quickly. We were thinking of reusing GVN to have such
> identifier if we wanted to do the outlining at IR level but solving this
> problem is hard."
>
> This outliner will catch your case, it is actually one of the easiest
> cases for it to catch. The outliner can recognize semantically equivalent
> instructions and can be expanded to catch even more.
>
> As for the cost model it is quite simple:
>  - We identify all of the external inputs into the sequence. For
> estimating the benefit we constant fold and condense the inputs so that we
> can get the set of *unique* inputs into the sequence.
>  - We also identify outputs from the sequence, instructions that have
> external references. We add the cost of storing/loading/passing output
> parameter to the outlined function.
>  - We identify one output to promote to a return value for the function.
> This can end up reducing an output parameter that is passed to our outlined
> function.
>  - We estimate the cost of the sequence of instructions by mostly using
> TTI.
>  - With the above we can estimate the cost per occurrence as well as the
> cost for the new function. Of course these are mostly estimates, we lean
> towards the conservative side to be safe.
>  - Finally we can compute an estimated benefit for the sequence taking
> into account benefit per occurrence and the estimated cost of the new
> function.
>
> There is definitely room for improvement in the cost model. I do not
> believe its the best but its shown to be somewhat reliable for most cases.
> It has benefits and it has regressions, as does the machine outliner. The
> reasoning for adding the new framework, is because I believe that this
> transformation should exist at the IR level. Not only because it is the
> simplest place to put it but also because it provides the greatest
> opportunities for improvement with the largest coverage. We can expand the
> framework to catch Danny's cases from the machine outliner RFC, we can add
> region outlining, etc. We can do all of this and make it available to every
> target automatically. The reason why I am for this being at the IR level is
> not because I want to split up the technical effort, but combine it.
>
> Thanks,
>   River Riddle
>
>
> On Mon, Jul 24, 2017 at 4:14 PM, Quentin Colombet <qcolombet at apple.com>
> wrote:
>
>> Hi River,
>>
>> On Jul 24, 2017, at 2:36 PM, River Riddle <riddleriver at gmail.com> wrote:
>>
>> Hi Quentin,
>>  I appreciate the feedback. When I reference the cost of Target Hooks
>> it's mainly for maintainability and cost on a target author. We want to
>> keep the intrusion into target information minimized. The heuristics used
>> for the outliner are the same used by any other IR level pass seeking
>> target information, i.e TTI for the most part. I can see where you are
>> coming from with "having heuristics solely focused on code size do not
>> seem realistic", but I don't agree with that statement.
>>
>>
>> If you only want code size I agree it makes sense, but I believe, even in
>> Oz, we probably don’t want to slow the code by a big factor for a couple
>> bytes. That’s what I wanted to say and what I wanted to point out is that
>> you need to have some kind of model for the performance to avoid those
>> worst cases. Unless we don’t care :).
>>
>> I think there is a disconnect on heuristics. The only user tunable
>> parameters are the lower bound parameters(to the cost model), the actual
>> analysis(heuristic calculation) is based upon TTI information.
>>
>>
>> I don’t see how you can get around adding more hooks to know how a
>> specific function prototype is going to be lowered (e.g., i64 needs to be
>> split into two registers, fourth and onward parameters need to be pushed on
>> the stack and so on). Those change the code size benefit.
>>
>>  When you say "Would still be interesting to see how well this could
>> perform on some exact model (i.e., at the Machine level), IMO." I am
>> slightly confused as to what you mean. I do not intend to try and implement
>> this algorithm at the MIR level given that it exists in Machine Outliner.
>>
>>
>> Of course, I don’t expect you to do that :). What I meant is that the
>> claim that IR offers the better trade off is not based on hard evidences. I
>> actually don’t buy it.
>>
>> My point was to make sure I understand what you are trying to solve and
>> given you have mentioned the MachineOutliner, why you are not working on
>> improving it instead of suggesting a new framework.
>> Don’t take me wrong, maybe creating a new framework at the IR level is
>> the right thing to do, but I still didn’t get that from your comments.
>>
>> There are several comparison benchmarks given in the "More detailed
>> performance data" of the original RFC. It includes comparisons to the
>> Machine Outliner when possible(I can't build clang on Linux with Machine
>> Outliner). I welcome any and all discussion on the placement of the
>> outliner in LLVM.
>>
>>
>> My fear with a new framework is that we are going to split the effort for
>> pushing the outliner technology forward and I’d like to avoid that if at
>> all possible.
>>
>> Now, to be more concrete on your proposal, could you describe the cost
>> model for deciding what to outline? (Really the cost model, not the suffix
>> algo.)
>> Are outlined functions pushed into the list candidates for further
>> outlining?
>>
>> Cheers,
>> -Quentin
>>
>>  Thanks,
>> River Riddle
>>
>> On Mon, Jul 24, 2017 at 1:42 PM, Quentin Colombet <qcolombet at apple.com>
>> wrote:
>>
>>> Hi River,
>>>
>>> On Jul 24, 2017, at 11:55 AM, River Riddle via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>> Hi Jessica,
>>>  The comparison to the inliner is an interesting one but we think it's
>>> important to note the difference in the use of heuristics. The inliner is
>>> juggling many different tasks at the same time, execution speed, code size,
>>> etc. which can cause the parameters to be very sensitive depending on the
>>> benchmark/platform/etc. The outliners heuristics are focused solely on the
>>> potential code size savings from outlining, and is thus only sensitive to
>>> the current platform. This only creates a problem when we are over
>>> estimating the potential cost of a set of instructions for a particular
>>> target. The cost model parameters are only minimums: instruction sequence
>>> length, estimated benefit, occurrence amount. The heuristics themselves are
>>> conservative and based upon all of the target information available at the
>>> IR level, the parameters are just setting a lower bound to weed out any
>>> outliers. You are correct in that being at the machine level, before or
>>> after RA, will give the most accurate heuristics but we feel there's an
>>> advantage to being at the IR level. At the IR level we can do so many more
>>> things that are either too difficult/complex for the machine level(e.g
>>> parameterization/outputs/etc). Not only can we do these things but they are
>>> available on all targets immediately, without the need for target hooks.
>>> The caution on the use of heuristics is understandable, but there comes a
>>> point when trade offs need to be made. We made the trade off for a loss in
>>> exact cost modeling to gain flexibility, coverage, and potential for
>>> further features. This trade off is the same made for quite a few IR level
>>> optimizations, including inlining. As for the worry about code size
>>> regressions, so far the results seem to support our hypothesis.
>>>
>>>
>>> Would still be interesting to see how well this could perform on some
>>> exact model (i.e., at the Machine level), IMO. Target hooks are cheap and
>>> choosing an implementation because it is simpler might not be the right
>>> long term solution.
>>> At the very least, to know what trade-off we are making, having
>>> prototypes with the different approaches sounds sensible.
>>> In particular, all the heuristics about cost for parameter passing
>>> (haven’t checked how you did it) sounds already complex enough and would
>>> require target hooks. Therefore, I am not seeing a clear win with an IR
>>> approach here.
>>>
>>> Finally, having heuristics solely focused on code size do not seem
>>> realistic. Indeed, I am guessing you have some thresholds to avoid
>>> outlining some piece of code too small that would end up adding a whole lot
>>> of indirections and I don’t like magic numbers in general :).
>>>
>>> To summarize, I wanted to point out that an IR approach is not as a
>>> clear win as you describe and would thus deserve more discussion.
>>>
>>> Cheers,
>>> -Quentin
>>>
>>>  Thanks,
>>> River Riddle
>>>
>>> On Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette <jpaquette at apple.com>
>>> wrote:
>>>
>>>>
>>>> Hi River,
>>>>
>>>> I’m working on the MachineOutliner pass at the MIR level. Working at
>>>> the IR level sounds interesting! It also seems like our algorithms are
>>>> similar. I was thinking of taking the suffix array route with the
>>>> MachineOutliner in the future.
>>>>
>>>> Anyway, I’d like to ask about this:
>>>>
>>>> On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>> The downside to having this type of transformation be at the IR level
>>>> is it means there will be less accuracy in the cost model -  we can
>>>> somewhat accurately model the cost per instruction but we can’t get
>>>> information on how a window of instructions may lower. This can cause
>>>> regressions depending on the platform/codebase, therefore to help alleviate
>>>> this there are several tunable parameters for the cost model.
>>>>
>>>>
>>>> The inliner is threshold-based and it can be rather unpredictable how
>>>> it will impact the code size of a program. Do you have any idea as to how
>>>> heuristics/thresholds/parameters could be tuned to prevent this? In my
>>>> experience, making good code size decisions with these sorts of passes
>>>> requires a lot of knowledge about what instructions you’re dealing with
>>>> exactly. I’ve seen the inliner cause some pretty serious code size
>>>> regressions in projects due to small changes to the cost model/parameters
>>>> which cause improvements in other projects. I’m a little worried that an
>>>> IR-level outliner for code size would be doomed to a similar fate.
>>>>
>>>> Perhaps it would be interesting to run this sort of pass pre-register
>>>> allocation? This would help pull you away from having to use heuristics,
>>>> but give you some more opportunities for finding repeated instruction
>>>> sequences. I’ve thought of doing something like this in the future with the
>>>> MachineOutliner and seeing how it goes.
>>>>
>>>> - Jessica
>>>>
>>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/b1814064/attachment.html>