[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks
Graham Yiu via llvm-dev
llvm-dev at lists.llvm.org
Thu Aug 24 12:47:09 PDT 2017
Hi David,
The only reason I can see to use the 'pattern matching' part as a fall-back
is in case we cannot inline the (what I'm assuming would be) a much bigger
hot-path-only cloned function for whatever reason. What I'm assuming here
is that after cold-region outlining, we may still have a large portion of
the original function body to attempt to inline, whereas the pattern
matching method will only contain a few basic blocks, giving a better
chance to inline something.
For your (2) point, I think we'll have to be careful here. Without a sense
of how 'likely' we're going to inline the new function, we'll have to make
sure our outlining of cold regions will not degrade the performance of the
function in 99.xx% of the cases, as it's unclear how much performance we'll
gain from just outlining (without inlining to increase the odds of some
performance gain). My initial thought was to ditch the new function and
its outlined children if we cannot immediately inline it.
Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-707/8200/Markham
Email: gyiu at ca.ibm.com
From: Xinliang David Li <xinliangli at gmail.com>
To: Graham Yiu <gyiu at ca.ibm.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>
Date: 08/24/2017 03:05 PM
Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
outlining scheme for cold blocks
On Thu, Aug 24, 2017 at 10:40 AM, Graham Yiu <gyiu at ca.ibm.com> wrote:
Hi David,
So I've began doing some implementation on the outlining portion of the
code. Currently, I got the partial inliner to outline cold regions
(single entry, single exit) of the code, based solely on the existence of
ProfileSummaryInfo (ie. profiling data). However, I have some concerns on
how this will co-exist with the existing code that peels early returns.
The control flow looks something like this:
// New Code: find cold regions to outline
if (!computeOutliningInfoForColdRegions()) {
// If we can't find any cold regions, then fall-back to early return
peeling
if (!computeOutliningInfo) {
return nullptr;
}
}
// Try to outline the identified regions
// Then try to inline the cloned function
My concern is during inlining, if we fail to inline the cloned function,
we give up and discard all cloned and outlined functions. But with these
two types of outlining we're doing, it's possible to attempt to inline
the cloned function that has outlined cold regions, and if we cannot do
so, try to inline a different clone that has peeled early returns (ie.
the way we have it today). This would require us to clone the original
function twice and modify one based on cold region outlining and the
other early return peeling, with the latter being our fall-back option if
we fail to inline the first clone.
What are your thoughts?
I expect computeOutliningInfoForColdRegions can produce a super set of
outlinable regions to the current 'pattern matching' approach. In other
words, most of the cases currently caught by 'computeOutlineInfo' should be
caught by the new algorithm, so why not ditching the current
'computeOutlningInfo' completely?
My suggestion was to enhance the pass to 1) support outlining multiple
regions; and 2) add a mode to do function outlining only (not the inlining
part). The second is important can be used before the regular inliner
pass. With the new pass manager and profile aware inlining, the inliner
won't undo the outline decision, but in meantime becomes more powerful due
to the reduced hot function size.
David
Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-707/8200/Markham
Email: gyiu at ca.ibm.com
Inactive hide details for Graham Yiu---08/15/2017 08:04:28 PM---Hey
David, Yes, we'll need to consider the effect on live rangeGraham
Yiu---08/15/2017 08:04:28 PM---Hey David, Yes, we'll need to consider the
effect on live ranges for regions we want to outline. In
From: Graham Yiu/Toronto/IBM
To: Xinliang David Li <xinliangli at gmail.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>
Date: 08/15/2017 08:04 PM
Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
outlining scheme for cold blocks
Hey David,
Yes, we'll need to consider the effect on live ranges for regions we want
to outline. In my experience, outlining live-exit regions seem to cause
the most harm as we ruin chances to keep data in registers as you were
alluding to. It's unclear, however, what the exact effect of outlining
regions with live-entries would be.
I'll probably try to avoid regions that are not single entry & single
exit at least initially, to simplify the transformation and analysis. Are
multi-exit regions common in your experience?
And of course, I agree, we should reuse as much of the current partial
inlining infrastructure as possible. I'll likely run some ideas by you as
I begin to make changes.
Cheers,
Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-407/8200/Markham
Email: gyiu at ca.ibm.com
Inactive hide details for Xinliang David Li ---08/15/2017 05:36:07
PM---Hi Graham, Making partial inlining more general is someXinliang
David Li ---08/15/2017 05:36:07 PM---Hi Graham, Making partial inlining
more general is something worth doing. Regarding your implementat
From: Xinliang David Li <xinliangli at gmail.com>
To: Graham Yiu <gyiu at ca.ibm.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>
Date: 08/15/2017 05:36 PM
Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
outlining scheme for cold blocks
Hi Graham, Making partial inlining more general is something worth
doing. Regarding your implementation plan, I have some suggestions here:
*) Function outlining introduces additional runtime cost: passing of live
in values, returning of live out values (via memory), glue code in the
caller to handle regions without a single exit block etc. The cost
analysis needs to factor in those carefully
*) Remove the limitation that there is only *one* outlined routine.
Instead, the algorithm can compute multiple single-entry/single exit or
single entry/multiple exit regions (cold ones) in the routine, and
outline each region into its own function. The benefit include
1) simplify the design and implementation and most of the existing
code can be reused;
2) provide more flexibility to allow most effective outlining;
3) reduced runtime overhead of making calls to the outline functions.
thanks,
David
On Tue, Aug 15, 2017 at 11:22 AM, Graham Yiu via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
Hello,
My team and I are looking to do some enhancements in the partial
inliner in opt. Would appreciate any feedback that folks might
have.
# Partial Inlining in LLVM opt
## Summary
### Background
Currently, the partial inliner searches the first few blocks of the
callee and looks for a branch to the return block (ie. early
return). If found, it attempts to outline the rest of the slow (or
heavy) code so the inliner will be able to inline the fast (or
light) code. If no early returns are found, the partial inliner
will give up. As far as I can tell, BlockFrequency and
BranchProbability information is only used when attempting to
inline the early return code, and not used to determine whether to
outline the slow code.
### Proposed changes
In addition to looking for early returns, we should utilize profile
information to outline blocks that are considered cold. If we can
sufficiently reduce the size of the original function via this type
of outlining, inlining should be able to inline the rest of the hot
code.
## Details
With the presence of profile information, we have a view of what
code is infrequently executed and make better decisions on what to
outline. Early return blocks that are infrequently executed should
still be included as candidates for outlining, but will be treated
just like any other cold blocks. Without profiling information,
however, we should remain conservative and only partial inline in
the presence of an early return in the first few blocks of a
function (ie. peel the early return out of the function).
To find cold regions to outline, we will traverse the CFG to find
edges deemed 'cold' and look at the blocks dominated by the
successor node. If, for some reason, that block has more than one
predecessor, then we will skip this candidate as there should be a
node that dominates this successor that has a single entry point.
The last node in the dominance vector should also have a single
successor. If it does not, then further investigation of the CFG is
necessary to see when/how this situation occurs.
We will need several heuristics to make sure we only outline in
cases where we are confident it will result in a performance gain.
Things such as threshold on when a branch is considered cold, the
minimum number of times the predecessor node has to be executed in
order for an edge to be considered (confidence factor), and the
minimum size of the region to be outlined (can use inlining cost
analysis like we currently do) will require some level of tuning.
Similar to the current implementation, we will attempt to inline
the leftover (hot) parts of the code, and if for some reason we
cannot then we discard the modified function and its outlined code.
### Code changes
The current Partial Inlining code first clones the function of
interest and looks for a single set of blocks to outline. It then
creates a function with the set the blocks, and saves the outlined
function and outline callsite information as part of the function
cloning container. In order to outline multiple regions of the
function, we will need to change these containers to keep track of
a list of regions to outline. We will also need to update the cost
analysis to take into account multiple outlined functions.
When a ProfileSummary is available, then we should skip the code
that looks for early returns and go into new code that looks for
cold regions to outline. When ProfileSummary is not available, then
we can fall back to the existing code and look for early returns
only.
### Tuning
- The outlining heuristics will need to determine if a set of cold
blocks is large enough to warrant the overhead of a function call.
We also don't want the inliner to attempt to inline the outlined
code later.
- The threshold for determining whether a block is cold will also
need to be tuned. In the case that profiling information is not
accurate, we will pay the price of the additional call overhead for
executing cold code.
- The confidence factor, which can be viewed as the minimum number
of times the predecessor has to be executed in order for an edge to
be considered cold, should also be taken into account to avoid
outlining code paths we have little information on.
Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-407/8200/Markham
Email: gyiu at ca.ibm.com
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170824/3dbca29d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170824/3dbca29d/attachment.gif>
More information about the llvm-dev
mailing list