[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Tue Aug 15 16:26:59 PDT 2017

On Tue, Aug 15, 2017 at 4:22 PM Xinliang David Li <xinliangli at gmail.com>
wrote:

> On Tue, Aug 15, 2017 at 4:14 PM, River Riddle via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hey Graham,
>>  I worked on pretty much this exact thing last year.  I did something
>> similar to what you described, I traversed the CFG and built potentially
>> profitable regions from any given valid start node. At that point there
>> were several road blocks that prevented it from really going anywhere( like
>> the inliner not preserving pgo data) but those problems should be gone by
>> now.
>> Some thoughts:
>>
>> : In terms of cost modeling, you will want to consider the interaction
>> with loops:
>>  - Whats the impact of outlining from inside of a loop.
>>  - What if the hot region does not include the entry/exit, i.e hot loop.
>>
>> : Something that would also benefit, but would require more tuning, is
>> the relationship between size and "hotness". If for example you have a
>> function that looks like this:
>>
>> foo(bool B) {
>> ... // Code that prevents early return.
>>
>> if(B) // RegionA : Taken 50% of the time.
>>   ... Very large amount of code.
>>
>> else // RegionB : Taken 50% of the time.
>>   ... Very small amount of code.
>> }
>>
>> In the above you have two regions that aren't exactly cold. Outlining
>> RegionA will make foo profitable to inline, but only if you take into
>> account its large size. Using profile data as the sole cutoff will miss
>> this opportunity.
>>
>>
>
> Modelling the cost/benefit for this case can be very tricky though.
>

Most definitely, but it's something to think about when moving forward.

>
>
>>
>> : You may also want to look into hot switch statement splitting, though I
>> don't know if it should really go into the partial inliner. This is
>> splitting a switch statement into multiple switch statements if there are
>> only a few cases that are actually hot.
>>
>> SWITCH1:
>> switch(val) {
>>   HOT: ...
>>   COLD1: ...
>>   COLD2: ...
>>   ....
>>   COLDN: ...
>>   default: ...
>> }
>>
>> would get transformed into:
>>
>
>
>>
>> SWITCH1:
>> switch(val) {
>>  HOT: ...
>>  default:
>>    br SWITCH2:
>> }
>>
>> SWITCH2:
>> switch(val) {
>>   COLD1: ...
>>   COLD2: ...
>>   ....
>>   COLDN: ...
>>   default: ...
>> }
>>
>>
>>
>
> This is profile based switch peeling -- which is a separate optimization
> from partial inlining.
>
> David
>
>
Ah thanks, I wasn't sure of the name.
River Riddle

>
>
>
>> On Tue, Aug 15, 2017 at 1:14 PM, Graham Yiu via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Hi Jessica,
>>>
>>> Thanks for the feedback.
>>>
>>> I believe the existing partial inliner pass does use some common
>>> utilities like the code extractor to do outlining.
>>>
>> AFAIK, Code extractor is the only IR level utility for outlining, but
>> it's not the best and definitely needs some improvement. It currently
>> returns outputs via live out pointer parameters, which means that outlined
>> functions can't be tagged with attributes like readonly.
>>
>> There were also some known problems with extracting in the presence of
>> critical edges, but I don't know if thats still the case.
>>
>>
>>> Is that what you're referring to? I don't recall seeing any other passes
>>> that has outlining other than the machine outliner, but I may have missed
>>> something.
>>>
>>> I briefly looked at River's RFC and it seems he's mainly utilizing
>>> outlining as a tool to reduce code size while maintaining code performance.
>>>
>>
>> The main point of the outliner I proposed in that RFC is benefiting code
>> size. The use of PGO data is to simply avoid detrimenting runtime
>> performance.
>>
>>
>>> Our proposal is to improve on the existing outlining scheme of the
>>> partial inliner to give us more opportunities to inline hot code. We will
>>> only do this type of general outlining with the presence of profiling or
>>> user data (such as builtin_expect).
>>>
>> With that said, I am interested in how River plans to handle live ranges
>>>
>>
>>  It's a pretty difficult problem. I have estimation for register usage
>> within an outlined region, using something similar to the register cost
>> calculation in the loop vectorizer. I don't currently have anything for the
>> actual call to the function itself. I haven't found a nice enough solution
>> that can work for a large number of candidates.
>>
>> and heuristics he will use with profiling data available.
>>>
>> As I said above, the use of profile data, currently, is to avoid
>> regressing execution time. So the heuristics are basically "Is this too hot
>> to outline from". That outliner also has a completely different goal/cost
>> model, which makes applying it to this problem difficult.
>>
>> Thanks,
>>  River Riddle
>>
>>
>>>
>>> I haven't looked at the machine outliner yet, but my understanding is it
>>> also prioritizes code size reduction over performance gains. Is my
>>> assumption correct? Do you think there's some tuning we can do in the
>>> machine outliner to tailor for performance improvements?
>>>
>>> Graham Yiu
>>> LLVM Compiler Development
>>> IBM Toronto Software Lab
>>> Office: (905) 413-4077 C2-407/8200/Markham
>>> Email: gyiu at ca.ibm.com
>>>
>>> [image: Inactive hide details for Jessica Paquette ---08/15/2017
>>> 02:45:06 PM---Hi Graham, Have you looked at using the existing outline]Jessica
>>> Paquette ---08/15/2017 02:45:06 PM---Hi Graham, Have you looked at using
>>> the existing outliner framework to implement partial inlining? I
>>>
>>> From: Jessica Paquette <jpaquette at apple.com>
>>> To: Graham Yiu <gyiu at ca.ibm.com>
>>> Cc: llvm-dev at lists.llvm.org
>>> Date: 08/15/2017 02:45 PM
>>> Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
>>> general outlining scheme for cold blocks
>>> Sent by: jpaquette at apple.com
>>> ------------------------------
>>>
>>>
>>>
>>> Hi Graham,
>>>
>>> Have you looked at using the existing outliner framework to implement
>>> partial inlining? In the past, people have combined outlining + inlining to
>>> get partial inlining “for free” (see "Function outlining and partial
>>> inlining” by Zhao and Amaral).
>>>
>>> There is the MachineOutliner. River Riddle has been working on an
>>> IR-level outliner as well. The MachineOutliner doesn’t currently consider
>>> the heat of a block, but might be worth experimenting with. It might be
>>> good to experiment with River’s work and the existing outlining pass to see
>>> if this functionality can effectively fall out of what we have already.
>>>
>>> Here’s the recent IR-level outlining thread for reference wrt River’s
>>> work: *http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html*
>>> <http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html>
>>>
>>> - Jessica
>>>
>>>    On Aug 15, 2017, at 11:22 AM, Graham Yiu via llvm-dev <
>>>       *llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>> wrote:
>>>
>>>       Hello,
>>>
>>>       My team and I are looking to do some enhancements in the partial
>>>       inliner in opt. Would appreciate any feedback that folks might have.
>>>
>>>       # Partial Inlining in LLVM opt
>>>
>>>       ## Summary
>>>
>>>       ### Background
>>>
>>>       Currently, the partial inliner searches the first few blocks of
>>>       the callee and looks for a branch to the return block (ie. early return).
>>>       If found, it attempts to outline the rest of the slow (or heavy) code so
>>>       the inliner will be able to inline the fast (or light) code. If no early
>>>       returns are found, the partial inliner will give up. As far as I can tell,
>>>       BlockFrequency and BranchProbability information is only used when
>>>       attempting to inline the early return code, and not used to determine
>>>       whether to outline the slow code.
>>>
>>>       ### Proposed changes
>>>
>>>       In addition to looking for early returns, we should utilize
>>>       profile information to outline blocks that are considered cold. If we can
>>>       sufficiently reduce the size of the original function via this type of
>>>       outlining, inlining should be able to inline the rest of the hot code.
>>>
>>>       ## Details
>>>
>>>       With the presence of profile information, we have a view of what
>>>       code is infrequently executed and make better decisions on what to outline.
>>>       Early return blocks that are infrequently executed should still be included
>>>       as candidates for outlining, but will be treated just like any other cold
>>>       blocks. Without profiling information, however, we should remain
>>>       conservative and only partial inline in the presence of an early return in
>>>       the first few blocks of a function (ie. peel the early return out of the
>>>       function).
>>>
>>>       To find cold regions to outline, we will traverse the CFG to find
>>>       edges deemed 'cold' and look at the blocks dominated by the successor node.
>>>       If, for some reason, that block has more than one predecessor, then we will
>>>       skip this candidate as there should be a node that dominates this successor
>>>       that has a single entry point. The last node in the dominance vector should
>>>       also have a single successor. If it does not, then further investigation of
>>>       the CFG is necessary to see when/how this situation occurs.
>>>
>>>       We will need several heuristics to make sure we only outline in
>>>       cases where we are confident it will result in a performance gain. Things
>>>       such as threshold on when a branch is considered cold, the minimum number
>>>       of times the predecessor node has to be executed in order for an edge to be
>>>       considered (confidence factor), and the minimum size of the region to be
>>>       outlined (can use inlining cost analysis like we currently do) will require
>>>       some level of tuning.
>>>
>>>       Similar to the current implementation, we will attempt to inline
>>>       the leftover (hot) parts of the code, and if for some reason we cannot then
>>>       we discard the modified function and its outlined code.
>>>
>>>       ### Code changes
>>>
>>>       The current Partial Inlining code first clones the function of
>>>       interest and looks for a single set of blocks to outline. It then creates a
>>>       function with the set the blocks, and saves the outlined function and
>>>       outline callsite information as part of the function cloning container. In
>>>       order to outline multiple regions of the function, we will need to change
>>>       these containers to keep track of a list of regions to outline. We will
>>>       also need to update the cost analysis to take into account multiple
>>>       outlined functions.
>>>
>>>       When a ProfileSummary is available, then we should skip the code
>>>       that looks for early returns and go into new code that looks for cold
>>>       regions to outline. When ProfileSummary is not available, then we can fall
>>>       back to the existing code and look for early returns only.
>>>
>>>       ### Tuning
>>>
>>>       - The outlining heuristics will need to determine if a set of
>>>       cold blocks is large enough to warrant the overhead of a function call. We
>>>       also don't want the inliner to attempt to inline the outlined code later.
>>>       - The threshold for determining whether a block is cold will also
>>>       need to be tuned. In the case that profiling information is not accurate,
>>>       we will pay the price of the additional call overhead for executing cold
>>>       code.
>>>       - The confidence factor, which can be viewed as the minimum
>>>       number of times the predecessor has to be executed in order for an edge to
>>>       be considered cold, should also be taken into account to avoid outlining
>>>       code paths we have little information on.
>>>
>>>       Graham Yiu
>>>       LLVM Compiler Development
>>>       IBM Toronto Software Lab
>>>       Office: (905) 413-4077 C2-407/8200/Markham
>>> *Email: gyiu at ca.ibm.com* <gyiu at ca.ibm.com>
>>>
>>>       _______________________________________________
>>>       LLVM Developers mailing list
>>> *llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>
>>>       http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> --
Thank you,
River Riddle
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/3a83b212/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/3a83b212/attachment.gif>