[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Tue Aug 29 10:44:12 PDT 2017

On Tue, Aug 29, 2017 at 9:15 AM, keita abdoul-kader via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> I second the fact that a way to outline specific function regions
> independently of the partial inliner sound very useful. I am not sure
> however if we would want a mode within the partialInliner or something
> completely independent.
>
> As a general question,   does anybody has a clear idea of what are the
> constraints on the region CodeExtractor is currently able to handle ?
> Going through the code, it looks like the only requirement is for the
> header to dominate all the BB in the region ;
>

It has some additional constraints on the types of instructions that it
will actually outline, see isBlockValidForExtraction in CodeExtractor.cpp.
For example, it prevents outlining things like invokes, allocas, invokes,
etc. There is also a restriction that the user needs to handle PHI nodes
that have inputs from more than one extracted block. This is handled in
PartialInliner.cpp currently but it applies to anything that uses it. The
comment in PartialInliner.cpp:

 // Special hackery is needed with PHI nodes that have inputs from more than
  // one extracted block.  For simplicity, just split the PHIs into a
two-level
  // sequence of PHIs, some of which will go in the extracted region, and
some
  // of which will go outside.

 Thanks,
 River Riddle

>
> On Sat, Aug 26, 2017 at 9:52 AM, Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>>
>>
>> On Thu, Aug 24, 2017 at 12:47 PM, Graham Yiu <gyiu at ca.ibm.com> wrote:
>>
>>> Hi David,
>>>
>>> The only reason I can see to use the 'pattern matching' part as a
>>> fall-back is in case we cannot inline the (what I'm assuming would be) a
>>> much bigger hot-path-only cloned function for whatever reason. What I'm
>>> assuming here is that after cold-region outlining, we may still have a
>>> large portion of the original function body to attempt to inline, whereas
>>> the pattern matching method will only contain a few basic blocks, giving a
>>> better chance to inline something.
>>>
>>
>> With profile data, the overhead of outlining a cold region can be
>> estimated more accurately. (With the new PM), the threshold of inlining a
>> hot callsite is also much higher. Without profile, the pattern matching
>> method won't work too well in general even though it can enable more more
>> inlining because the call overhead introduced to call the outlined function
>> may outweigh the benefit of inlining the caller.
>>
>> What ever region that can be found by the pattern matching method should
>> be identified by the new method as well. If there are multiple (but
>> mutually exclusive) candidate regions found, the cost analysis heuristic
>> should pick the best candidate region for outlining .
>>
>>
>>>
>>>
>>> For your (2) point, I think we'll have to be careful here. Without a
>>> sense of how 'likely' we're going to inline the new function, we'll have to
>>> make sure our outlining of cold regions will not degrade the performance of
>>> the function in 99.xx% of the cases, as it's unclear how much performance
>>> we'll gain from just outlining (without inlining to increase the odds of
>>> some performance gain). My initial thought was to ditch the new function
>>> and its outlined children if we cannot immediately inline it.
>>>
>>
>> The outlining only mode is useful to enable more aggressive inlining for
>> the regular inlining pass. Slightly different heuristics can be applied
>> here. For instance it can prefer largest candidate region (to maximiize the
>> chance to inline the caller). The outlined region does not need to be super
>> cold and leave it to the inliner to do more deeper analysis and decide to
>> inline it right back in.
>>
>> David
>>
>>
>>
>>>
>>>
>>> Graham Yiu
>>> LLVM Compiler Development
>>> IBM Toronto Software Lab
>>> Office: (905) 413-4077 C2-707/8200/Markham
>>> Email: gyiu at ca.ibm.com
>>>
>>> [image: Inactive hide details for Xinliang David Li ---08/24/2017
>>> 03:05:06 PM---On Thu, Aug 24, 2017 at 10:40 AM, Graham Yiu <gyiu at ca.i]Xinliang
>>> David Li ---08/24/2017 03:05:06 PM---On Thu, Aug 24, 2017 at 10:40 AM,
>>> Graham Yiu <gyiu at ca.ibm.com> wrote: > Hi David,
>>>
>>> From: Xinliang David Li <xinliangli at gmail.com>
>>> To: Graham Yiu <gyiu at ca.ibm.com>
>>> Cc: llvm-dev <llvm-dev at lists.llvm.org>
>>> Date: 08/24/2017 03:05 PM
>>>
>>> Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
>>> general outlining scheme for cold blocks
>>> ------------------------------
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Aug 24, 2017 at 10:40 AM, Graham Yiu <*gyiu at ca.ibm.com*
>>> <gyiu at ca.ibm.com>> wrote:
>>>
>>>    Hi David,
>>>
>>>    So I've began doing some implementation on the outlining portion of
>>>    the code. Currently, I got the partial inliner to outline cold regions
>>>    (single entry, single exit) of the code, based solely on the existence of
>>>    ProfileSummaryInfo (ie. profiling data). However, I have some concerns on
>>>    how this will co-exist with the existing code that peels early returns.
>>>
>>>    The control flow looks something like this:
>>>
>>>    // New Code: find cold regions to outline
>>>    if (!computeOutliningInfoForColdRegions()) {
>>>    // If we can't find any cold regions, then fall-back to early return
>>>    peeling
>>>    if (!computeOutliningInfo) {
>>>    return nullptr;
>>>    }
>>>    }
>>>    // Try to outline the identified regions
>>>    // Then try to inline the cloned function
>>>
>>>    My concern is during inlining, if we fail to inline the cloned
>>>    function, we give up and discard all cloned and outlined functions. But
>>>    with these two types of outlining we're doing, it's possible to attempt to
>>>    inline the cloned function that has outlined cold regions, and if we cannot
>>>    do so, try to inline a different clone that has peeled early returns (ie.
>>>    the way we have it today). This would require us to clone the original
>>>    function twice and modify one based on cold region outlining and the other
>>>    early return peeling, with the latter being our fall-back option if we fail
>>>    to inline the first clone.
>>>
>>>    What are your thoughts?
>>>
>>>
>>>
>>> I expect  computeOutliningInfoForColdRegions can produce a super set of
>>> outlinable regions to the current 'pattern matching' approach. In other
>>> words, most of the cases currently caught by 'computeOutlineInfo' should be
>>> caught by the new algorithm, so why not ditching the current
>>> 'computeOutlningInfo' completely?
>>>
>>> My suggestion was to enhance the pass to 1) support outlining multiple
>>> regions; and 2) add a mode to do function outlining only (not the inlining
>>> part).  The second is important can be used before the regular inliner
>>> pass.   With the new pass manager and profile aware inlining, the inliner
>>> won't undo the outline decision, but in meantime becomes more powerful due
>>> to the reduced hot function size.
>>>
>>> David
>>>
>>>
>>>
>>>    Graham Yiu
>>>    LLVM Compiler Development
>>>    IBM Toronto Software Lab
>>>    Office: *(905) 413-4077* <(905)%20413-4077> C2-707/8200/Markham
>>>    Email: *gyiu at ca.ibm.com* <gyiu at ca.ibm.com>
>>>
>>>    [image: Inactive hide details for Graham Yiu---08/15/2017 08:04:28
>>>    PM---Hey David, Yes, we'll need to consider the effect on live range]Graham
>>>    Yiu---08/15/2017 08:04:28 PM---Hey David, Yes, we'll need to consider the
>>>    effect on live ranges for regions we want to outline. In
>>>
>>>    From: Graham Yiu/Toronto/IBM
>>>    To: Xinliang David Li <*xinliangli at gmail.com* <xinliangli at gmail.com>>
>>>    Cc: llvm-dev <*llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>>
>>>    Date: 08/15/2017 08:04 PM
>>>    Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
>>>    general outlining scheme for cold blocks
>>>    ------------------------------
>>>
>>>
>>>    Hey David,
>>>
>>>    Yes, we'll need to consider the effect on live ranges for regions we
>>>    want to outline. In my experience, outlining live-exit regions seem to
>>>    cause the most harm as we ruin chances to keep data in registers as you
>>>    were alluding to. It's unclear, however, what the exact effect of outlining
>>>    regions with live-entries would be.
>>>
>>>    I'll probably try to avoid regions that are not single entry &
>>>    single exit at least initially, to simplify the transformation and
>>>    analysis. Are multi-exit regions common in your experience?
>>>
>>>    And of course, I agree, we should reuse as much of the current
>>>    partial inlining infrastructure as possible. I'll likely run some ideas by
>>>    you as I begin to make changes.
>>>
>>>    Cheers,
>>>
>>>    Graham Yiu
>>>    LLVM Compiler Development
>>>    IBM Toronto Software Lab
>>>    Office: *(905) 413-4077* <(905)%20413-4077> C2-407/8200/Markham
>>>    Email: *gyiu at ca.ibm.com* <gyiu at ca.ibm.com>
>>>
>>>
>>>    [image: Inactive hide details for Xinliang David Li ---08/15/2017
>>>    05:36:07 PM---Hi Graham, Making partial inlining more general is some]Xinliang
>>>    David Li ---08/15/2017 05:36:07 PM---Hi Graham, Making partial inlining
>>>    more general is something worth doing. Regarding your implementat
>>>
>>>    From: Xinliang David Li <*xinliangli at gmail.com*
>>>    <xinliangli at gmail.com>>
>>>    To: Graham Yiu <*gyiu at ca.ibm.com* <gyiu at ca.ibm.com>>
>>>    Cc: llvm-dev <*llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>>
>>>    Date: 08/15/2017 05:36 PM
>>>    Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
>>>    general outlining scheme for cold blocks
>>>    ------------------------------
>>>
>>>
>>>
>>>
>>>    Hi Graham, Making partial inlining more general is something worth
>>>    doing.  Regarding your implementation plan, I have some suggestions here:
>>>
>>>    *) Function outlining introduces additional runtime cost: passing of
>>>    live in values, returning of live out values (via memory), glue code in the
>>>    caller to handle regions without a single exit block etc.  The cost
>>>    analysis needs to factor in those carefully
>>>    *) Remove the limitation that there is only *one* outlined routine.
>>>    Instead, the algorithm can compute multiple single-entry/single exit or
>>>    single entry/multiple exit regions (cold ones) in the routine, and outline
>>>    each region into its own function. The benefit include
>>>       1) simplify the design and implementation and most of the
>>>    existing code can be reused;
>>>       2) provide more flexibility to allow most effective outlining;
>>>       3) reduced runtime overhead of making calls to the outline
>>>    functions.
>>>
>>>    thanks,
>>>
>>>    David
>>>
>>>    On Tue, Aug 15, 2017 at 11:22 AM, Graham Yiu via llvm-dev <
>>>    *llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>> wrote:
>>>       Hello,
>>>
>>>          My team and I are looking to do some enhancements in the
>>>          partial inliner in opt. Would appreciate any feedback that folks might have.
>>>
>>>          # Partial Inlining in LLVM opt
>>>
>>>          ## Summary
>>>
>>>          ### Background
>>>
>>>          Currently, the partial inliner searches the first few blocks
>>>          of the callee and looks for a branch to the return block (ie. early
>>>          return). If found, it attempts to outline the rest of the slow (or heavy)
>>>          code so the inliner will be able to inline the fast (or light) code. If no
>>>          early returns are found, the partial inliner will give up. As far as I can
>>>          tell, BlockFrequency and BranchProbability information is only used when
>>>          attempting to inline the early return code, and not used to determine
>>>          whether to outline the slow code.
>>>
>>>          ### Proposed changes
>>>
>>>          In addition to looking for early returns, we should utilize
>>>          profile information to outline blocks that are considered cold. If we can
>>>          sufficiently reduce the size of the original function via this type of
>>>          outlining, inlining should be able to inline the rest of the hot code.
>>>
>>>          ## Details
>>>
>>>          With the presence of profile information, we have a view of
>>>          what code is infrequently executed and make better decisions on what to
>>>          outline. Early return blocks that are infrequently executed should still be
>>>          included as candidates for outlining, but will be treated just like any
>>>          other cold blocks. Without profiling information, however, we should remain
>>>          conservative and only partial inline in the presence of an early return in
>>>          the first few blocks of a function (ie. peel the early return out of the
>>>          function).
>>>
>>>          To find cold regions to outline, we will traverse the CFG to
>>>          find edges deemed 'cold' and look at the blocks dominated by the successor
>>>          node. If, for some reason, that block has more than one predecessor, then
>>>          we will skip this candidate as there should be a node that dominates this
>>>          successor that has a single entry point. The last node in the dominance
>>>          vector should also have a single successor. If it does not, then further
>>>          investigation of the CFG is necessary to see when/how this situation occurs.
>>>
>>>          We will need several heuristics to make sure we only outline
>>>          in cases where we are confident it will result in a performance gain.
>>>          Things such as threshold on when a branch is considered cold, the minimum
>>>          number of times the predecessor node has to be executed in order for an
>>>          edge to be considered (confidence factor), and the minimum size of the
>>>          region to be outlined (can use inlining cost analysis like we currently do)
>>>          will require some level of tuning.
>>>
>>>          Similar to the current implementation, we will attempt to
>>>          inline the leftover (hot) parts of the code, and if for some reason we
>>>          cannot then we discard the modified function and its outlined code.
>>>
>>>          ### Code changes
>>>
>>>          The current Partial Inlining code first clones the function of
>>>          interest and looks for a single set of blocks to outline. It then creates a
>>>          function with the set the blocks, and saves the outlined function and
>>>          outline callsite information as part of the function cloning container. In
>>>          order to outline multiple regions of the function, we will need to change
>>>          these containers to keep track of a list of regions to outline. We will
>>>          also need to update the cost analysis to take into account multiple
>>>          outlined functions.
>>>
>>>          When a ProfileSummary is available, then we should skip the
>>>          code that looks for early returns and go into new code that looks for cold
>>>          regions to outline. When ProfileSummary is not available, then we can fall
>>>          back to the existing code and look for early returns only.
>>>
>>>          ### Tuning
>>>
>>>          - The outlining heuristics will need to determine if a set of
>>>          cold blocks is large enough to warrant the overhead of a function call. We
>>>          also don't want the inliner to attempt to inline the outlined code later.
>>>          - The threshold for determining whether a block is cold will
>>>          also need to be tuned. In the case that profiling information is not
>>>          accurate, we will pay the price of the additional call overhead for
>>>          executing cold code.
>>>          - The confidence factor, which can be viewed as the minimum
>>>          number of times the predecessor has to be executed in order for an edge to
>>>          be considered cold, should also be taken into account to avoid outlining
>>>          code paths we have little information on.
>>>
>>>          Graham Yiu
>>>          LLVM Compiler Development
>>>          IBM Toronto Software Lab
>>>          Office: *(905) 413-4077* <(905)%20413-4077> C2-407/8200/Markham
>>>          Email: *gyiu at ca.ibm.com* <gyiu at ca.ibm.com>
>>>
>>>          _______________________________________________
>>>          LLVM Developers mailing list
>>> *llvm-dev at lists.llvm.org* <llvm-dev at lists.llvm.org>
>>> *http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev*
>>>          <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=4ST7e3kMd0GTi3w9ByK5Cw&m=rbfPPnRP9weVvtwCT5LyhMrn3TeP6-HaVUUkv-DHQ5I&s=0NPYoALj0vvVlLnq4AKtctnM_tHFxPY6SsX5mv2LMUE&e=>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170829/cd430273/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170829/cd430273/attachment.gif>