[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Mon Jul 24 21:02:55 PDT 2017

On Mon, Jul 24, 2017 at 6:24 PM, Quentin Colombet via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Hi River,
>
> Thanks for the detailed explanation.
>
> If people are okay for you to move forward, like I said to Andrey, I won’t
> oppose. I feel sad we have to split our effort on outlining technology, but
> I certainly don’t pretend to know what is best!
> The bottom line is if people are happy with that going in, the
> conversation on the details can continue in parallel.
>
> On Jul 24, 2017, at 4:56 PM, River Riddle <riddleriver at gmail.com> wrote:
>
> Hey Quentin,
>  Sorry I missed the last question. Currently it doesn't do continual
> outlining, but it's definitely a possibility.
>
>
> Ok, that means we probably won’t have the very bad runtime problems I had
> in mind with adding a lot of indirections.
>
> Thanks,
> River Riddle
>
> On Mon, Jul 24, 2017 at 4:36 PM, River Riddle <riddleriver at gmail.com>
> wrote:
>
>> Hi Quentin,
>>   I understand your points and I believe that some meaning is being lost
>> via email. For performance it's true that that cost isn't necessarily
>> modeled, there is currently only support for using profile data to avoid
>> mitigate that. I was working under the assumption, possibly incorrectly,
>> that at Oz we favor small code over anything else including runtime
>> performance. This is why we only run at Os if we have profile data, and
>> even then we are very conservative about what we outline from.
>>   I also understand that some target hooks may be required for certain
>> things, it happens for other IR level passes as well. I just want to
>> minimize that behavior as much as I can, though thats a personal choice.
>> As for a motivating reason to push for this, it actually solves some of
>> the problems that you brought up in the post for the original Machine
>> Outliner RFC. If I can quote you:
>>
>> "So basically, better cost model. That's one part of the story.
>>
>> The other part is at the LLVM IR level or before register allocation
>> identifying similar code sequence is much harder, at least with a suffix
>> tree like algorithm. Basically the problem is how do we name our
>> instructions such that we can match them.
>> Let me take an example.
>> foo() {
>> /* bunch of code */
>> a = add b, c;
>> d = add e, f;
>> }
>>
>> bar() {
>> d = add e, g;
>> f = add c, w;
>> }
>>
>> With proper renaming, we can outline both adds in one function. The
>> difficulty is to recognize that they are semantically equivalent to give
>> them the same identifier in the suffix tree. I won’t get into the details
>> but it gets tricky quickly. We were thinking of reusing GVN to have such
>> identifier if we wanted to do the outlining at IR level but solving this
>> problem is hard."
>>
>> This outliner will catch your case, it is actually one of the easiest
>> cases for it to catch. The outliner can recognize semantically equivalent
>> instructions and can be expanded to catch even more.
>>
>
> Interesting, could you explain how you do that?
> I didn’t see it in the original post.
>
>
>> As for the cost model it is quite simple:
>>  - We identify all of the external inputs into the sequence. For
>> estimating the benefit we constant fold and condense the inputs so that we
>> can get the set of *unique* inputs into the sequence.
>>
>
> Ok, those are your parameters. How do you account for the cost of setting
> up those parameters?
>
>  - We also identify outputs from the sequence, instructions that have
>> external references. We add the cost of storing/loading/passing output
>> parameter to the outlined function.
>>
>
> Ok, those are your return values (or your output parameter). I see the
> cost computation you do on those, but it still miss the general parameter
> setup cost.
>
>  - We identify one output to promote to a return value for the function.
>> This can end up reducing an output parameter that is passed to our outlined
>> function.
>>
>
> How do you choose that one?
>
>  - We estimate the cost of the sequence of instructions by mostly using
>> TTI.
>>
>
> Sounds sensible. (Although I am not a fan of the whole TTI thing :)).
>
>  - With the above we can estimate the cost per occurrence as well as the
>> cost for the new function. Of course these are mostly estimates, we lean
>> towards the conservative side to be safe.
>>
>
> Conservative in what sense? Put differently how do you know your cost is
> conservative?
>
>  - Finally we can compute an estimated benefit for the sequence taking
>> into account benefit per occurrence and the estimated cost of the new
>> function.
>>
>
> Make sense.
>
>
>> There is definitely room for improvement in the cost model. I do not
>> believe its the best but its shown to be somewhat reliable for most cases.
>> It has benefits and it has regressions, as does the machine outliner.
>>
>
> Regressions in what sense?
> Do you actually have functions that are bigger?
>
> To clarify, AFAIR, the machine outliner does not have such regressions per
> say. The outliner does perfectly what the cost model predicted: functions
> are individually smaller. Code size grow may happen because of padding in
> the object file (and getting in the way of some linker optimization).
>

The current MIR outliner doesn't take into account instruction encoding
length either. Considering that on x86 instructions can commonly be both 1
and 10+ bytes long, the variability from not modeling that is probably
comparable to the inaccuracy of estimating a fixed cost for an LLVM IR
instruction w.r.t. its lowering to machine code.

As a simple example, many commonly used x86 instructions encode as over 5
bytes, which is the size of a call instruction. So an outlined function
that consists of a single instruction can be profitable. IIRC there was
about 5% code size saving just from outlining single instructions (that
encode at >5 bytes) at machine code level on the test case I looked at (I
mentined it in a comment on one of Jessica's patches IIRC).

Do we have a way to get an instruction's exact encoded length for the MIR
outliner?

-- Sean Silva

>
> The reasoning for adding the new framework, is because I believe that this
>> transformation should exist at the IR level. Not only because it is the
>> simplest place to put it but also because it provides the greatest
>> opportunities for improvement with the largest coverage.
>>
>
> Again I don’t get those arguments. Machine passes can be target
> independent. Sure they may require the targets to adopt new hooks but
> that’s about it and I don’t see how this is different from having to add
> adopt hooks from TTI. I am guessing you can mostly achieve your goals with
> the existing TTI hooks, but the arguments passing ones, which is the part I
> missed in your explanation :).
>
> We can expand the framework to catch Danny's cases from the machine
>> outliner RFC, we can add region outlining, etc. We can do all of this and
>> make it available to every target automatically. The reason why I am for
>> this being at the IR level is not because I want to split up the technical
>> effort, but combine it.
>>
>
> I fail to see the combining part here :).
>
> To give you a bit context, I am afraid that, indeed, the implementation is
> going to be simpler (partly because a lot of people are more familiar with
> LLVM IR than MachineInstr) and for the most part TTI is going to be good
> enough. However, when we will want to go the extra mile and do crazy stuff,
> we will, like with the vectorizer IMHO, hit that wall of TTI  abstractions
> that we are going to work around in various ways. I’d like, if at all
> possible, to avoid repeating the history here.
>
> Cheers,
> -Quentin
>
>
>> Thanks,
>>   River Riddle
>>
>>
>> On Mon, Jul 24, 2017 at 4:14 PM, Quentin Colombet <qcolombet at apple.com>
>> wrote:
>>
>>> Hi River,
>>>
>>> On Jul 24, 2017, at 2:36 PM, River Riddle <riddleriver at gmail.com> wrote:
>>>
>>> Hi Quentin,
>>>  I appreciate the feedback. When I reference the cost of Target Hooks
>>> it's mainly for maintainability and cost on a target author. We want to
>>> keep the intrusion into target information minimized. The heuristics used
>>> for the outliner are the same used by any other IR level pass seeking
>>> target information, i.e TTI for the most part. I can see where you are
>>> coming from with "having heuristics solely focused on code size do not
>>> seem realistic", but I don't agree with that statement.
>>>
>>>
>>> If you only want code size I agree it makes sense, but I believe, even
>>> in Oz, we probably don’t want to slow the code by a big factor for a couple
>>> bytes. That’s what I wanted to say and what I wanted to point out is that
>>> you need to have some kind of model for the performance to avoid those
>>> worst cases. Unless we don’t care :).
>>>
>>> I think there is a disconnect on heuristics. The only user tunable
>>> parameters are the lower bound parameters(to the cost model), the actual
>>> analysis(heuristic calculation) is based upon TTI information.
>>>
>>>
>>> I don’t see how you can get around adding more hooks to know how a
>>> specific function prototype is going to be lowered (e.g., i64 needs to be
>>> split into two registers, fourth and onward parameters need to be pushed on
>>> the stack and so on). Those change the code size benefit.
>>>
>>>  When you say "Would still be interesting to see how well this could
>>> perform on some exact model (i.e., at the Machine level), IMO." I am
>>> slightly confused as to what you mean. I do not intend to try and implement
>>> this algorithm at the MIR level given that it exists in Machine Outliner.
>>>
>>>
>>> Of course, I don’t expect you to do that :). What I meant is that the
>>> claim that IR offers the better trade off is not based on hard evidences. I
>>> actually don’t buy it.
>>>
>>> My point was to make sure I understand what you are trying to solve and
>>> given you have mentioned the MachineOutliner, why you are not working on
>>> improving it instead of suggesting a new framework.
>>> Don’t take me wrong, maybe creating a new framework at the IR level is
>>> the right thing to do, but I still didn’t get that from your comments.
>>>
>>> There are several comparison benchmarks given in the "More detailed
>>> performance data" of the original RFC. It includes comparisons to the
>>> Machine Outliner when possible(I can't build clang on Linux with Machine
>>> Outliner). I welcome any and all discussion on the placement of the
>>> outliner in LLVM.
>>>
>>>
>>> My fear with a new framework is that we are going to split the effort
>>> for pushing the outliner technology forward and I’d like to avoid that if
>>> at all possible.
>>>
>>> Now, to be more concrete on your proposal, could you describe the cost
>>> model for deciding what to outline? (Really the cost model, not the suffix
>>> algo.)
>>> Are outlined functions pushed into the list candidates for further
>>> outlining?
>>>
>>> Cheers,
>>> -Quentin
>>>
>>>  Thanks,
>>> River Riddle
>>>
>>> On Mon, Jul 24, 2017 at 1:42 PM, Quentin Colombet <qcolombet at apple.com>
>>> wrote:
>>>
>>>> Hi River,
>>>>
>>>> On Jul 24, 2017, at 11:55 AM, River Riddle via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>> Hi Jessica,
>>>>  The comparison to the inliner is an interesting one but we think it's
>>>> important to note the difference in the use of heuristics. The inliner is
>>>> juggling many different tasks at the same time, execution speed, code size,
>>>> etc. which can cause the parameters to be very sensitive depending on the
>>>> benchmark/platform/etc. The outliners heuristics are focused solely on the
>>>> potential code size savings from outlining, and is thus only sensitive to
>>>> the current platform. This only creates a problem when we are over
>>>> estimating the potential cost of a set of instructions for a particular
>>>> target. The cost model parameters are only minimums: instruction sequence
>>>> length, estimated benefit, occurrence amount. The heuristics themselves are
>>>> conservative and based upon all of the target information available at the
>>>> IR level, the parameters are just setting a lower bound to weed out any
>>>> outliers. You are correct in that being at the machine level, before or
>>>> after RA, will give the most accurate heuristics but we feel there's an
>>>> advantage to being at the IR level. At the IR level we can do so many more
>>>> things that are either too difficult/complex for the machine level(e.g
>>>> parameterization/outputs/etc). Not only can we do these things but they are
>>>> available on all targets immediately, without the need for target hooks.
>>>> The caution on the use of heuristics is understandable, but there comes a
>>>> point when trade offs need to be made. We made the trade off for a loss in
>>>> exact cost modeling to gain flexibility, coverage, and potential for
>>>> further features. This trade off is the same made for quite a few IR level
>>>> optimizations, including inlining. As for the worry about code size
>>>> regressions, so far the results seem to support our hypothesis.
>>>>
>>>>
>>>> Would still be interesting to see how well this could perform on some
>>>> exact model (i.e., at the Machine level), IMO. Target hooks are cheap and
>>>> choosing an implementation because it is simpler might not be the right
>>>> long term solution.
>>>> At the very least, to know what trade-off we are making, having
>>>> prototypes with the different approaches sounds sensible.
>>>> In particular, all the heuristics about cost for parameter passing
>>>> (haven’t checked how you did it) sounds already complex enough and would
>>>> require target hooks. Therefore, I am not seeing a clear win with an IR
>>>> approach here.
>>>>
>>>> Finally, having heuristics solely focused on code size do not seem
>>>> realistic. Indeed, I am guessing you have some thresholds to avoid
>>>> outlining some piece of code too small that would end up adding a whole lot
>>>> of indirections and I don’t like magic numbers in general :).
>>>>
>>>> To summarize, I wanted to point out that an IR approach is not as a
>>>> clear win as you describe and would thus deserve more discussion.
>>>>
>>>> Cheers,
>>>> -Quentin
>>>>
>>>>  Thanks,
>>>> River Riddle
>>>>
>>>> On Mon, Jul 24, 2017 at 11:12 AM, Jessica Paquette <jpaquette at apple.com
>>>> > wrote:
>>>>
>>>>>
>>>>> Hi River,
>>>>>
>>>>> I’m working on the MachineOutliner pass at the MIR level. Working at
>>>>> the IR level sounds interesting! It also seems like our algorithms are
>>>>> similar. I was thinking of taking the suffix array route with the
>>>>> MachineOutliner in the future.
>>>>>
>>>>> Anyway, I’d like to ask about this:
>>>>>
>>>>> On Jul 20, 2017, at 3:47 PM, River Riddle via llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>> The downside to having this type of transformation be at the IR level
>>>>> is it means there will be less accuracy in the cost model -  we can
>>>>> somewhat accurately model the cost per instruction but we can’t get
>>>>> information on how a window of instructions may lower. This can cause
>>>>> regressions depending on the platform/codebase, therefore to help alleviate
>>>>> this there are several tunable parameters for the cost model.
>>>>>
>>>>>
>>>>> The inliner is threshold-based and it can be rather unpredictable how
>>>>> it will impact the code size of a program. Do you have any idea as to how
>>>>> heuristics/thresholds/parameters could be tuned to prevent this? In
>>>>> my experience, making good code size decisions with these sorts of passes
>>>>> requires a lot of knowledge about what instructions you’re dealing with
>>>>> exactly. I’ve seen the inliner cause some pretty serious code size
>>>>> regressions in projects due to small changes to the cost model/parameters
>>>>> which cause improvements in other projects. I’m a little worried that an
>>>>> IR-level outliner for code size would be doomed to a similar fate.
>>>>>
>>>>> Perhaps it would be interesting to run this sort of pass pre-register
>>>>> allocation? This would help pull you away from having to use heuristics,
>>>>> but give you some more opportunities for finding repeated instruction
>>>>> sequences. I’ve thought of doing something like this in the future with the
>>>>> MachineOutliner and seeing how it goes.
>>>>>
>>>>> - Jessica
>>>>>
>>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/f2c35973/attachment.html>