[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks

Tue Aug 29 10:24:17 PDT 2017

Hi Kader,

I agree with you, if we were going to only do outlining for some functions
and not immediately attempt to inline, it should be an independent pass.
The partial inliner should do what its name suggests and attempt to inline
or bail.

I haven't looked through the CodeExtractor at all.  I imagine I'll have to
go through it at some point.  I'd also be interested in something that does
an analysis before code extraction that tells me how many live ranges I'm
going to be killing or how many symbols I'm going  to be taking the address
of by extracting a specific region of code.  Not sure if that currently
exists.

Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077      C2-707/8200/Markham
Email: gyiu at ca.ibm.com

From:	keita abdoul-kader <abdoulk.keita at gmail.com>
To:	Xinliang David Li <xinliangli at gmail.com>
Cc:	Graham Yiu <gyiu at ca.ibm.com>, llvm-dev
            <llvm-dev at lists.llvm.org>
Date:	08/29/2017 12:15 PM
Subject:	Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
            outlining scheme for cold blocks

I second the fact that a way to outline specific function regions
independently of the partial inliner sound very useful. I am not sure
however if we would want a mode within the partialInliner or something
completely independent.

As a general question,   does anybody has a clear idea of what are the
constraints on the region CodeExtractor is currently able to handle ?
Going through the code, it looks like the only requirement is for the
header to dominate all the BB in the region ;

On Sat, Aug 26, 2017 at 9:52 AM, Xinliang David Li via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

  On Thu, Aug 24, 2017 at 12:47 PM, Graham Yiu <gyiu at ca.ibm.com> wrote:
   Hi David,

   The only reason I can see to use the 'pattern matching' part as a
   fall-back is in case we cannot inline the (what I'm assuming would be) a
   much bigger hot-path-only cloned function for whatever reason. What I'm
   assuming here is that after cold-region outlining, we may still have a
   large portion of the original function body to attempt to inline,
   whereas the pattern matching method will only contain a few basic
   blocks, giving a better chance to inline something.

  With profile data, the overhead of outlining a cold region can be
  estimated more accurately. (With the new PM), the threshold of inlining a
  hot callsite is also much higher. Without profile, the pattern matching
  method won't work too well in general even though it can enable more more
  inlining because the call overhead introduced to call the outlined
  function may outweigh the benefit of inlining the caller.

  What ever region that can be found by the pattern matching method should
  be identified by the new method as well. If there are multiple (but
  mutually exclusive) candidate regions found, the cost analysis heuristic
  should pick the best candidate region for outlining .

   For your (2) point, I think we'll have to be careful here. Without a
   sense of how 'likely' we're going to inline the new function, we'll have
   to make sure our outlining of cold regions will not degrade the
   performance of the function in 99.xx% of the cases, as it's unclear how
   much performance we'll gain from just outlining (without inlining to
   increase the odds of some performance gain). My initial thought was to
   ditch the new function and its outlined children if we cannot
   immediately inline it.

  The outlining only mode is useful to enable more aggressive inlining for
  the regular inlining pass. Slightly different heuristics can be applied
  here. For instance it can prefer largest candidate region (to maximiize
  the chance to inline the caller). The outlined region does not need to be
  super cold and leave it to the inliner to do more deeper analysis and
  decide to inline it right back in.

  David

   Graham Yiu
   LLVM Compiler Development
   IBM Toronto Software Lab
   Office: (905) 413-4077 C2-707/8200/Markham
   Email: gyiu at ca.ibm.com

   Inactive hide details for Xinliang David Li ---08/24/2017 03:05:06
   PM---On Thu, Aug 24, 2017 at 10:40 AM, Graham Yiu <gyiu at ca.iXinliang
   David Li ---08/24/2017 03:05:06 PM---On Thu, Aug 24, 2017 at 10:40 AM,
   Graham Yiu <gyiu at ca.ibm.com> wrote: > Hi David,

   From: Xinliang David Li <xinliangli at gmail.com>
   To: Graham Yiu <gyiu at ca.ibm.com>
   Cc: llvm-dev <llvm-dev at lists.llvm.org>
   Date: 08/24/2017 03:05 PM

   Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
   outlining scheme for cold blocks

   On Thu, Aug 24, 2017 at 10:40 AM, Graham Yiu <gyiu at ca.ibm.com> wrote:
         Hi David,

         So I've began doing some implementation on the outlining portion
         of the code. Currently, I got the partial inliner to outline cold
         regions (single entry, single exit) of the code, based solely on
         the existence of ProfileSummaryInfo (ie. profiling data). However,
         I have some concerns on how this will co-exist with the existing
         code that peels early returns.

         The control flow looks something like this:

         // New Code: find cold regions to outline
         if (!computeOutliningInfoForColdRegions()) {
         // If we can't find any cold regions, then fall-back to early
         return peeling
         if (!computeOutliningInfo) {
         return nullptr;
         }
         }
         // Try to outline the identified regions
         // Then try to inline the cloned function

         My concern is during inlining, if we fail to inline the cloned
         function, we give up and discard all cloned and outlined
         functions. But with these two types of outlining we're doing, it's
         possible to attempt to inline the cloned function that has
         outlined cold regions, and if we cannot do so, try to inline a
         different clone that has peeled early returns (ie. the way we have
         it today). This would require us to clone the original function
         twice and modify one based on cold region outlining and the other
         early return peeling, with the latter being our fall-back option
         if we fail to inline the first clone.

         What are your thoughts?

   I expect  computeOutliningInfoForColdRegions can produce a super set of
   outlinable regions to the current 'pattern matching' approach. In other
   words, most of the cases currently caught by 'computeOutlineInfo' should
   be caught by the new algorithm, so why not ditching the current
   'computeOutlningInfo' completely?

   My suggestion was to enhance the pass to 1) support outlining multiple
   regions; and 2) add a mode to do function outlining only (not the
   inlining part).  The second is important can be used before the regular
   inliner pass.   With the new pass manager and profile aware inlining,
   the inliner won't undo the outline decision, but in meantime becomes
   more powerful due to the reduced hot function size.

   David

         Graham Yiu
         LLVM Compiler Development
         IBM Toronto Software Lab
         Office: (905) 413-4077 C2-707/8200/Markham
         Email: gyiu at ca.ibm.com

         Inactive hide details for Graham Yiu---08/15/2017 08:04:28
         PM---Hey David, Yes, we'll need to consider the effect on live
         rangeGraham Yiu---08/15/2017 08:04:28 PM---Hey David, Yes, we'll
         need to consider the effect on live ranges for regions we want to
         outline. In

         From: Graham Yiu/Toronto/IBM
         To: Xinliang David Li <xinliangli at gmail.com>
         Cc: llvm-dev <llvm-dev at lists.llvm.org>
         Date: 08/15/2017 08:04 PM
         Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
         general outlining scheme for cold blocks

         Hey David,

         Yes, we'll need to consider the effect on live ranges for regions
         we want to outline. In my experience, outlining live-exit regions
         seem to cause the most harm as we ruin chances to keep data in
         registers as you were alluding to. It's unclear, however, what the
         exact effect of outlining regions with live-entries would be.

         I'll probably try to avoid regions that are not single entry &
         single exit at least initially, to simplify the transformation and
         analysis. Are multi-exit regions common in your experience?

         And of course, I agree, we should reuse as much of the current
         partial inlining infrastructure as possible. I'll likely run some
         ideas by you as I begin to make changes.

         Cheers,

         Graham Yiu
         LLVM Compiler Development
         IBM Toronto Software Lab
         Office: (905) 413-4077 C2-407/8200/Markham
         Email: gyiu at ca.ibm.com

         Inactive hide details for Xinliang David Li ---08/15/2017 05:36:07
         PM---Hi Graham, Making partial inlining more general is some
         Xinliang David Li ---08/15/2017 05:36:07 PM---Hi Graham, Making
         partial inlining more general is something worth doing. Regarding
         your implementat

         From: Xinliang David Li <xinliangli at gmail.com>
         To: Graham Yiu <gyiu at ca.ibm.com>
         Cc: llvm-dev <llvm-dev at lists.llvm.org>
         Date: 08/15/2017 05:36 PM
         Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
         general outlining scheme for cold blocks

         Hi Graham, Making partial inlining more general is something worth
         doing.  Regarding your implementation plan, I have some
         suggestions here:

         *) Function outlining introduces additional runtime cost: passing
         of live in values, returning of live out values (via memory), glue
         code in the caller to handle regions without a single exit block
         etc.  The cost analysis needs to factor in those carefully
         *) Remove the limitation that there is only *one* outlined
         routine. Instead, the algorithm can compute multiple
         single-entry/single exit or single entry/multiple exit regions
         (cold ones) in the routine, and outline each region into its own
         function. The benefit include
            1) simplify the design and implementation and most of the
         existing code can be reused;
            2) provide more flexibility to allow most effective outlining;
            3) reduced runtime overhead of making calls to the outline
         functions.

         thanks,

         David

         On Tue, Aug 15, 2017 at 11:22 AM, Graham Yiu via llvm-dev <
         llvm-dev at lists.llvm.org> wrote:
                     Hello,

                     My team and I are looking to do some enhancements in
                     the partial inliner in opt. Would appreciate any
                     feedback that folks might have.

                     # Partial Inlining in LLVM opt

                     ## Summary

                     ### Background

                     Currently, the partial inliner searches the first few
                     blocks of the callee and looks for a branch to the
                     return block (ie. early return). If found, it attempts
                     to outline the rest of the slow (or heavy) code so the
                     inliner will be able to inline the fast (or light)
                     code. If no early returns are found, the partial
                     inliner will give up. As far as I can tell,
                     BlockFrequency and BranchProbability information is
                     only used when attempting to inline the early return
                     code, and not used to determine whether to outline the
                     slow code.

                     ### Proposed changes

                     In addition to looking for early returns, we should
                     utilize profile information to outline blocks that are
                     considered cold. If we can sufficiently reduce the
                     size of the original function via this type of
                     outlining, inlining should be able to inline the rest
                     of the hot code.

                     ## Details

                     With the presence of profile information, we have a
                     view of what code is infrequently executed and make
                     better decisions on what to outline. Early return
                     blocks that are infrequently executed should still be
                     included as candidates for outlining, but will be
                     treated just like any other cold blocks. Without
                     profiling information, however, we should remain
                     conservative and only partial inline in the presence
                     of an early return in the first few blocks of a
                     function (ie. peel the early return out of the
                     function).

                     To find cold regions to outline, we will traverse the
                     CFG to find edges deemed 'cold' and look at the blocks
                     dominated by the successor node. If, for some reason,
                     that block has more than one predecessor, then we will
                     skip this candidate as there should be a node that
                     dominates this successor that has a single entry
                     point. The last node in the dominance vector should
                     also have a single successor. If it does not, then
                     further investigation of the CFG is necessary to see
                     when/how this situation occurs.

                     We will need several heuristics to make sure we only
                     outline in cases where we are confident it will result
                     in a performance gain. Things such as threshold on
                     when a branch is considered cold, the minimum number
                     of times the predecessor node has to be executed in
                     order for an edge to be considered (confidence
                     factor), and the minimum size of the region to be
                     outlined (can use inlining cost analysis like we
                     currently do) will require some level of tuning.

                     Similar to the current implementation, we will attempt
                     to inline the leftover (hot) parts of the code, and if
                     for some reason we cannot then we discard the modified
                     function and its outlined code.

                     ### Code changes

                     The current Partial Inlining code first clones the
                     function of interest and looks for a single set of
                     blocks to outline. It then creates a function with the
                     set the blocks, and saves the outlined function and
                     outline callsite information as part of the function
                     cloning container. In order to outline multiple
                     regions of the function, we will need to change these
                     containers to keep track of a list of regions to
                     outline. We will also need to update the cost analysis
                     to take into account multiple outlined functions.

                     When a ProfileSummary is available, then we should
                     skip the code that looks for early returns and go into
                     new code that looks for cold regions to outline. When
                     ProfileSummary is not available, then we can fall back
                     to the existing code and look for early returns only.

                     ### Tuning

                     - The outlining heuristics will need to determine if a
                     set of cold blocks is large enough to warrant the
                     overhead of a function call. We also don't want the
                     inliner to attempt to inline the outlined code later.
                     - The threshold for determining whether a block is
                     cold will also need to be tuned. In the case that
                     profiling information is not accurate, we will pay the
                     price of the additional call overhead for executing
                     cold code.
                     - The confidence factor, which can be viewed as the
                     minimum number of times the predecessor has to be
                     executed in order for an edge to be considered cold,
                     should also be taken into account to avoid outlining
                     code paths we have little information on.

                     Graham Yiu
                     LLVM Compiler Development
                     IBM Toronto Software Lab
                     Office: (905) 413-4077 C2-407/8200/Markham
                     Email: gyiu at ca.ibm.com

                     _______________________________________________
                     LLVM Developers mailing list
                     llvm-dev at lists.llvm.org
                     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

  _______________________________________________
  LLVM Developers mailing list
  llvm-dev at lists.llvm.org
  http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170829/78e34ca7/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170829/78e34ca7/attachment.gif>