[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Tue Aug 15 17:04:28 PDT 2017

Hey David,

Yes, we'll need to consider the effect on live ranges for regions we want
to outline.  In my experience, outlining live-exit regions seem to cause
the most harm as we ruin chances to keep data in registers as you were
alluding to.  It's unclear, however, what the exact effect of outlining
regions with live-entries would be.

I'll probably try to avoid regions that are not single entry & single exit
at least initially, to simplify the transformation and analysis.  Are
multi-exit regions common in your experience?

And of course, I agree, we should reuse as much of the current partial
inlining infrastructure as possible.  I'll likely run some ideas by you as
I begin to make changes.

Cheers,

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077      C2-407/8200/Markham
Email: gyiu at ca.ibm.com

From:	Xinliang David Li <xinliangli at gmail.com>
To:	Graham Yiu <gyiu at ca.ibm.com>
Cc:	llvm-dev <llvm-dev at lists.llvm.org>
Date:	08/15/2017 05:36 PM
Subject:	Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
            outlining scheme for cold blocks

Hi Graham, Making partial inlining more general is something worth doing.
Regarding your implementation plan, I have some suggestions here:

*) Function outlining introduces additional runtime cost: passing of live
in values, returning of live out values (via memory), glue code in the
caller to handle regions without a single exit block etc.  The cost
analysis needs to factor in those carefully
*) Remove the limitation that there is only *one* outlined routine.
Instead, the algorithm can compute multiple single-entry/single exit or
single entry/multiple exit regions (cold ones) in the routine, and outline
each region into its own function. The benefit include
   1) simplify the design and implementation and most of the existing code
can be reused;
   2) provide more flexibility to allow most effective outlining;
   3) reduced runtime overhead of making calls to the outline functions.

thanks,

David

On Tue, Aug 15, 2017 at 11:22 AM, Graham Yiu via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
  Hello,

  My team and I are looking to do some enhancements in the partial inliner
  in opt. Would appreciate any feedback that folks might have.

  # Partial Inlining in LLVM opt

  ## Summary

  ### Background

  Currently, the partial inliner searches the first few blocks of the
  callee and looks for a branch to the return block (ie. early return). If
  found, it attempts to outline the rest of the slow (or heavy) code so the
  inliner will be able to inline the fast (or light) code. If no early
  returns are found, the partial inliner will give up. As far as I can
  tell, BlockFrequency and BranchProbability information is only used when
  attempting to inline the early return code, and not used to determine
  whether to outline the slow code.

  ### Proposed changes

  In addition to looking for early returns, we should utilize profile
  information to outline blocks that are considered cold. If we can
  sufficiently reduce the size of the original function via this type of
  outlining, inlining should be able to inline the rest of the hot code.

  ## Details

  With the presence of profile information, we have a view of what code is
  infrequently executed and make better decisions on what to outline. Early
  return blocks that are infrequently executed should still be included as
  candidates for outlining, but will be treated just like any other cold
  blocks. Without profiling information, however, we should remain
  conservative and only partial inline in the presence of an early return
  in the first few blocks of a function (ie. peel the early return out of
  the function).

  To find cold regions to outline, we will traverse the CFG to find edges
  deemed 'cold' and look at the blocks dominated by the successor node. If,
  for some reason, that block has more than one predecessor, then we will
  skip this candidate as there should be a node that dominates this
  successor that has a single entry point. The last node in the dominance
  vector should also have a single successor. If it does not, then further
  investigation of the CFG is necessary to see when/how this situation
  occurs.

  We will need several heuristics to make sure we only outline in cases
  where we are confident it will result in a performance gain. Things such
  as threshold on when a branch is considered cold, the minimum number of
  times the predecessor node has to be executed in order for an edge to be
  considered (confidence factor), and the minimum size of the region to be
  outlined (can use inlining cost analysis like we currently do) will
  require some level of tuning.

  Similar to the current implementation, we will attempt to inline the
  leftover (hot) parts of the code, and if for some reason we cannot then
  we discard the modified function and its outlined code.

  ### Code changes

  The current Partial Inlining code first clones the function of interest
  and looks for a single set of blocks to outline. It then creates a
  function with the set the blocks, and saves the outlined function and
  outline callsite information as part of the function cloning container.
  In order to outline multiple regions of the function, we will need to
  change these containers to keep track of a list of regions to outline. We
  will also need to update the cost analysis to take into account multiple
  outlined functions.

  When a ProfileSummary is available, then we should skip the code that
  looks for early returns and go into new code that looks for cold regions
  to outline. When ProfileSummary is not available, then we can fall back
  to the existing code and look for early returns only.

  ### Tuning

  - The outlining heuristics will need to determine if a set of cold blocks
  is large enough to warrant the overhead of a function call. We also don't
  want the inliner to attempt to inline the outlined code later.
  - The threshold for determining whether a block is cold will also need to
  be tuned. In the case that profiling information is not accurate, we will
  pay the price of the additional call overhead for executing cold code.
  - The confidence factor, which can be viewed as the minimum number of
  times the predecessor has to be executed in order for an edge to be
  considered cold, should also be taken into account to avoid outlining
  code paths we have little information on.

  Graham Yiu
  LLVM Compiler Development
  IBM Toronto Software Lab
  Office: (905) 413-4077 C2-407/8200/Markham
  Email: gyiu at ca.ibm.com

  _______________________________________________
  LLVM Developers mailing list
  llvm-dev at lists.llvm.org
  http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/b663c474/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/b663c474/attachment.gif>