[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Tue Aug 15 11:22:42 PDT 2017

Hello,

 My team and I are looking to do some enhancements in the partial inliner
 in opt.  Would appreciate any feedback that folks might have.

 # Partial Inlining in LLVM opt

 ## Summary

 ### Background

 Currently, the partial inliner searches the first few blocks of the callee
 and looks for a branch to the return block (ie. early return).  If found,
 it attempts to outline the rest of the slow (or heavy) code so the inliner
 will be able to inline the fast (or light) code.  If no early returns are
 found, the partial inliner will give up.  As far as I can tell,
 BlockFrequency and BranchProbability information is only used when
 attempting to inline the early return code, and not used to determine
 whether to outline the slow code.

 ### Proposed changes

 In addition to looking for early returns, we should utilize profile
 information to outline blocks that are considered cold.  If we can
 sufficiently reduce the size of the original function via this type of
 outlining, inlining should be able to inline the rest of the hot code.

 ## Details

 With the presence of profile information, we have a view of what code is
 infrequently executed and make better decisions on what to outline.  Early
 return blocks that are infrequently executed should still be included as
 candidates for outlining, but will be treated just like any other cold
 blocks.  Without profiling information, however, we should remain
 conservative and only partial inline in the presence of an early return in
 the first few blocks of a function (ie. peel the early return out of the
 function).

 To find cold regions to outline, we will traverse the CFG to find edges
 deemed 'cold' and look at the blocks dominated by the successor node.  If,
 for some reason, that block has more than one predecessor, then we will
 skip this candidate as there should be a node that dominates this
 successor that has a single entry point.  The last node in the dominance
 vector should also have a single successor.  If it does not, then further
 investigation of the CFG is necessary to see when/how this situation
 occurs.

 We will need several heuristics to make sure we only outline in cases
 where we are confident it will result in a performance gain.  Things such
 as threshold on when a branch is considered cold, the minimum number of
 times the predecessor node has to be executed in order for an edge to be
 considered (confidence factor), and the minimum size of the region to be
 outlined (can use inlining cost analysis like we currently do) will
 require some level of tuning.

 Similar to the current implementation, we will attempt to inline the
 leftover (hot) parts of the code, and if for some reason we cannot then we
 discard the modified function and its outlined code.

 ### Code changes

 The current Partial Inlining code first clones the function of interest
 and looks for a single set of blocks to outline.  It then creates a
 function with the set the blocks, and saves the outlined function and
 outline callsite information as part of the function cloning container.
 In order to outline multiple regions of the function, we will need to
 change these containers to keep track of a list of regions to outline.  We
 will also need to update the cost analysis to take into account multiple
 outlined functions.

 When a ProfileSummary is available, then we should skip the code that
 looks for early returns and go into new code that looks for cold regions
 to outline.  When ProfileSummary is not available, then we can fall back
 to the existing code and look for early returns only.

 ### Tuning

 - The outlining heuristics will need to determine if a set of cold blocks
 is large enough to warrant the overhead of a function call.  We also don't
 want the inliner to attempt to inline the outlined code later.
 - The threshold for determining whether a block is cold will also need to
 be tuned.  In the case that profiling information is not accurate, we will
 pay the price of the additional call overhead for executing cold code.
 - The confidence factor, which can be viewed as the minimum number of
 times the predecessor has to be executed in order for an edge to be
 considered cold, should also be taken into account to avoid outlining code
 paths we have little information on.

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077      C2-407/8200/Markham
Email: gyiu at ca.ibm.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/17fc26c9/attachment.html>