[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks
Graham Yiu via llvm-dev
llvm-dev at lists.llvm.org
Tue Aug 29 10:19:10 PDT 2017
Hi David,
"Without profile, the pattern matching method won't work too well in
general even though it can enable more more inlining because the call
overhead introduced to call the outlined function may outweigh the benefit
of inlining the caller"
- I'm not sure I understand why or how there'd be any additional call
overhead if we partially inlined one or more early returns. The call
overhead of not inlining the function at all vs. call overhead of the
outlined function should be similar, right?
- i.e. if we partially inlined bar into foo, we'd have some early return
checks and then a function call to bar_outlined, but if we didn't inline
bar into foo at all, we'd still have the call overhead of calling bar.
"The outlining only mode is useful to enable more aggressive inlining for
the regular inlining pass. Slightly different heuristics can be applied
here. For instance it can prefer largest candidate region (to maximiize the
chance to inline the caller). The outlined region does not need to be super
cold and leave it to the inliner to do more deeper analysis and decide to
inline it right back in."
- Interesting point, I never thought of that. The partial inliner would
then be a 'function splitter' rather than a partial inliner at that
point. Maybe worthwhile to create a separate pass for this so we don't
have the partial inliner trying to do too many things.
- So I did some digging and it seems like the 'regular' inlining pass in
opt is ran before the partial inliner, which makes sense to me since we
want only the candidates left over from inlining which it could not
inline (maybe due to code size). Is there another inlining pass
downstream that I may have missed? Or perhaps you're referring to
inlining with ThinLTO?
Cheers,
Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-707/8200/Markham
Email: gyiu at ca.ibm.com
From: Xinliang David Li <xinliangli at gmail.com>
To: Graham Yiu <gyiu at ca.ibm.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>
Date: 08/26/2017 12:53 PM
Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
outlining scheme for cold blocks
On Thu, Aug 24, 2017 at 12:47 PM, Graham Yiu <gyiu at ca.ibm.com> wrote:
Hi David,
The only reason I can see to use the 'pattern matching' part as a
fall-back is in case we cannot inline the (what I'm assuming would be) a
much bigger hot-path-only cloned function for whatever reason. What I'm
assuming here is that after cold-region outlining, we may still have a
large portion of the original function body to attempt to inline, whereas
the pattern matching method will only contain a few basic blocks, giving
a better chance to inline something.
With profile data, the overhead of outlining a cold region can be estimated
more accurately. (With the new PM), the threshold of inlining a hot
callsite is also much higher. Without profile, the pattern matching method
won't work too well in general even though it can enable more more inlining
because the call overhead introduced to call the outlined function may
outweigh the benefit of inlining the caller.
What ever region that can be found by the pattern matching method should be
identified by the new method as well. If there are multiple (but mutually
exclusive) candidate regions found, the cost analysis heuristic should pick
the best candidate region for outlining .
For your (2) point, I think we'll have to be careful here. Without a
sense of how 'likely' we're going to inline the new function, we'll have
to make sure our outlining of cold regions will not degrade the
performance of the function in 99.xx% of the cases, as it's unclear how
much performance we'll gain from just outlining (without inlining to
increase the odds of some performance gain). My initial thought was to
ditch the new function and its outlined children if we cannot immediately
inline it.
The outlining only mode is useful to enable more aggressive inlining for
the regular inlining pass. Slightly different heuristics can be applied
here. For instance it can prefer largest candidate region (to maximiize the
chance to inline the caller). The outlined region does not need to be super
cold and leave it to the inliner to do more deeper analysis and decide to
inline it right back in.
David
Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-707/8200/Markham
Email: gyiu at ca.ibm.com
Inactive hide details for Xinliang David Li ---08/24/2017 03:05:06
PM---On Thu, Aug 24, 2017 at 10:40 AM, Graham Yiu <gyiu at ca.iXinliang
David Li ---08/24/2017 03:05:06 PM---On Thu, Aug 24, 2017 at 10:40 AM,
Graham Yiu <gyiu at ca.ibm.com> wrote: > Hi David,
From: Xinliang David Li <xinliangli at gmail.com>
To: Graham Yiu <gyiu at ca.ibm.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>
Date: 08/24/2017 03:05 PM
Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
outlining scheme for cold blocks
On Thu, Aug 24, 2017 at 10:40 AM, Graham Yiu <gyiu at ca.ibm.com> wrote:
Hi David,
So I've began doing some implementation on the outlining portion of
the code. Currently, I got the partial inliner to outline cold
regions (single entry, single exit) of the code, based solely on
the existence of ProfileSummaryInfo (ie. profiling data). However,
I have some concerns on how this will co-exist with the existing
code that peels early returns.
The control flow looks something like this:
// New Code: find cold regions to outline
if (!computeOutliningInfoForColdRegions()) {
// If we can't find any cold regions, then fall-back to early
return peeling
if (!computeOutliningInfo) {
return nullptr;
}
}
// Try to outline the identified regions
// Then try to inline the cloned function
My concern is during inlining, if we fail to inline the cloned
function, we give up and discard all cloned and outlined functions.
But with these two types of outlining we're doing, it's possible to
attempt to inline the cloned function that has outlined cold
regions, and if we cannot do so, try to inline a different clone
that has peeled early returns (ie. the way we have it today). This
would require us to clone the original function twice and modify
one based on cold region outlining and the other early return
peeling, with the latter being our fall-back option if we fail to
inline the first clone.
What are your thoughts?
I expect computeOutliningInfoForColdRegions can produce a super set of
outlinable regions to the current 'pattern matching' approach. In other
words, most of the cases currently caught by 'computeOutlineInfo' should
be caught by the new algorithm, so why not ditching the current
'computeOutlningInfo' completely?
My suggestion was to enhance the pass to 1) support outlining multiple
regions; and 2) add a mode to do function outlining only (not the
inlining part). The second is important can be used before the regular
inliner pass. With the new pass manager and profile aware inlining, the
inliner won't undo the outline decision, but in meantime becomes more
powerful due to the reduced hot function size.
David
Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-707/8200/Markham
Email: gyiu at ca.ibm.com
Inactive hide details for Graham Yiu---08/15/2017 08:04:28 PM---Hey
David, Yes, we'll need to consider the effect on live rangeGraham
Yiu---08/15/2017 08:04:28 PM---Hey David, Yes, we'll need to
consider the effect on live ranges for regions we want to outline.
In
From: Graham Yiu/Toronto/IBM
To: Xinliang David Li <xinliangli at gmail.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>
Date: 08/15/2017 08:04 PM
Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
general outlining scheme for cold blocks
Hey David,
Yes, we'll need to consider the effect on live ranges for regions
we want to outline. In my experience, outlining live-exit regions
seem to cause the most harm as we ruin chances to keep data in
registers as you were alluding to. It's unclear, however, what the
exact effect of outlining regions with live-entries would be.
I'll probably try to avoid regions that are not single entry &
single exit at least initially, to simplify the transformation and
analysis. Are multi-exit regions common in your experience?
And of course, I agree, we should reuse as much of the current
partial inlining infrastructure as possible. I'll likely run some
ideas by you as I begin to make changes.
Cheers,
Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-407/8200/Markham
Email: gyiu at ca.ibm.com
Inactive hide details for Xinliang David Li ---08/15/2017 05:36:07
PM---Hi Graham, Making partial inlining more general is some
Xinliang David Li ---08/15/2017 05:36:07 PM---Hi Graham, Making
partial inlining more general is something worth doing. Regarding
your implementat
From: Xinliang David Li <xinliangli at gmail.com>
To: Graham Yiu <gyiu at ca.ibm.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>
Date: 08/15/2017 05:36 PM
Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
general outlining scheme for cold blocks
Hi Graham, Making partial inlining more general is something worth
doing. Regarding your implementation plan, I have some suggestions
here:
*) Function outlining introduces additional runtime cost: passing
of live in values, returning of live out values (via memory), glue
code in the caller to handle regions without a single exit block
etc. The cost analysis needs to factor in those carefully
*) Remove the limitation that there is only *one* outlined routine.
Instead, the algorithm can compute multiple single-entry/single
exit or single entry/multiple exit regions (cold ones) in the
routine, and outline each region into its own function. The benefit
include
1) simplify the design and implementation and most of the
existing code can be reused;
2) provide more flexibility to allow most effective outlining;
3) reduced runtime overhead of making calls to the outline
functions.
thanks,
David
On Tue, Aug 15, 2017 at 11:22 AM, Graham Yiu via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
Hello,
My team and I are looking to do some enhancements in
the partial inliner in opt. Would appreciate any
feedback that folks might have.
# Partial Inlining in LLVM opt
## Summary
### Background
Currently, the partial inliner searches the first few
blocks of the callee and looks for a branch to the
return block (ie. early return). If found, it attempts
to outline the rest of the slow (or heavy) code so the
inliner will be able to inline the fast (or light)
code. If no early returns are found, the partial
inliner will give up. As far as I can tell,
BlockFrequency and BranchProbability information is
only used when attempting to inline the early return
code, and not used to determine whether to outline the
slow code.
### Proposed changes
In addition to looking for early returns, we should
utilize profile information to outline blocks that are
considered cold. If we can sufficiently reduce the size
of the original function via this type of outlining,
inlining should be able to inline the rest of the hot
code.
## Details
With the presence of profile information, we have a
view of what code is infrequently executed and make
better decisions on what to outline. Early return
blocks that are infrequently executed should still be
included as candidates for outlining, but will be
treated just like any other cold blocks. Without
profiling information, however, we should remain
conservative and only partial inline in the presence of
an early return in the first few blocks of a function
(ie. peel the early return out of the function).
To find cold regions to outline, we will traverse the
CFG to find edges deemed 'cold' and look at the blocks
dominated by the successor node. If, for some reason,
that block has more than one predecessor, then we will
skip this candidate as there should be a node that
dominates this successor that has a single entry point.
The last node in the dominance vector should also have
a single successor. If it does not, then further
investigation of the CFG is necessary to see when/how
this situation occurs.
We will need several heuristics to make sure we only
outline in cases where we are confident it will result
in a performance gain. Things such as threshold on when
a branch is considered cold, the minimum number of
times the predecessor node has to be executed in order
for an edge to be considered (confidence factor), and
the minimum size of the region to be outlined (can use
inlining cost analysis like we currently do) will
require some level of tuning.
Similar to the current implementation, we will attempt
to inline the leftover (hot) parts of the code, and if
for some reason we cannot then we discard the modified
function and its outlined code.
### Code changes
The current Partial Inlining code first clones the
function of interest and looks for a single set of
blocks to outline. It then creates a function with the
set the blocks, and saves the outlined function and
outline callsite information as part of the function
cloning container. In order to outline multiple regions
of the function, we will need to change these
containers to keep track of a list of regions to
outline. We will also need to update the cost analysis
to take into account multiple outlined functions.
When a ProfileSummary is available, then we should skip
the code that looks for early returns and go into new
code that looks for cold regions to outline. When
ProfileSummary is not available, then we can fall back
to the existing code and look for early returns only.
### Tuning
- The outlining heuristics will need to determine if a
set of cold blocks is large enough to warrant the
overhead of a function call. We also don't want the
inliner to attempt to inline the outlined code later.
- The threshold for determining whether a block is cold
will also need to be tuned. In the case that profiling
information is not accurate, we will pay the price of
the additional call overhead for executing cold code.
- The confidence factor, which can be viewed as the
minimum number of times the predecessor has to be
executed in order for an edge to be considered cold,
should also be taken into account to avoid outlining
code paths we have little information on.
Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-407/8200/Markham
Email: gyiu at ca.ibm.com
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170829/4894bf7c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170829/4894bf7c/attachment-0001.gif>
More information about the llvm-dev
mailing list