[llvm-dev] [RFC] Enhance Partial Inliner by using a general outlining scheme for cold blocks
Graham Yiu via llvm-dev
llvm-dev at lists.llvm.org
Tue Aug 15 17:21:44 PDT 2017
Hey River,
It's good to know that you have been down this path before, hopefully I can
share with you any roadblocks I'll encounter along the way.
Your first example illustrates a very good point, where ideally we should
take into account a coldness-to-size ratio. However, this type of case
requires a lot of additional analysis to get right, and may not give you
the most 'bang-for-your-buck'. With that said, I think it's worthwhile to
build in some flexibility as I begin implementation in case we want to
tackle something like this in the future.
Second example is probably outside of the scope of what I'm trying to do,
at least in the immediate term, but interesting nonetheless.
Cheers,
Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-407/8200/Markham
Email: gyiu at ca.ibm.com
From: River Riddle <riddleriver at gmail.com>
To: Xinliang David Li <xinliangli at gmail.com>
Cc: Graham Yiu <gyiu at ca.ibm.com>, llvm-dev
<llvm-dev at lists.llvm.org>
Date: 08/15/2017 07:27 PM
Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a general
outlining scheme for cold blocks
On Tue, Aug 15, 2017 at 4:22 PM Xinliang David Li <xinliangli at gmail.com>
wrote:
On Tue, Aug 15, 2017 at 4:14 PM, River Riddle via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
Hey Graham,
I worked on pretty much this exact thing last year. I did something
similar to what you described, I traversed the CFG and built potentially
profitable regions from any given valid start node. At that point there
were several road blocks that prevented it from really going anywhere
( like the inliner not preserving pgo data) but those problems should be
gone by now.
Some thoughts:
: In terms of cost modeling, you will want to consider the interaction
with loops:
- Whats the impact of outlining from inside of a loop.
- What if the hot region does not include the entry/exit, i.e hot loop.
: Something that would also benefit, but would require more tuning, is
the relationship between size and "hotness". If for example you have a
function that looks like this:
foo(bool B) {
... // Code that prevents early return.
if(B) // RegionA : Taken 50% of the time.
... Very large amount of code.
else // RegionB : Taken 50% of the time.
... Very small amount of code.
}
In the above you have two regions that aren't exactly cold. Outlining
RegionA will make foo profitable to inline, but only if you take into
account its large size. Using profile data as the sole cutoff will miss
this opportunity.
Modelling the cost/benefit for this case can be very tricky though.
Most definitely, but it's something to think about when moving forward.
: You may also want to look into hot switch statement splitting, though
I don't know if it should really go into the partial inliner. This is
splitting a switch statement into multiple switch statements if there
are only a few cases that are actually hot.
SWITCH1:
switch(val) {
HOT: ...
COLD1: ...
COLD2: ...
....
COLDN: ...
default: ...
}
would get transformed into:
SWITCH1:
switch(val) {
HOT: ...
default:
br SWITCH2:
}
SWITCH2:
switch(val) {
COLD1: ...
COLD2: ...
....
COLDN: ...
default: ...
}
This is profile based switch peeling -- which is a separate optimization
from partial inlining.
David
Ah thanks, I wasn't sure of the name.
River Riddle
On Tue, Aug 15, 2017 at 1:14 PM, Graham Yiu via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
Hi Jessica,
Thanks for the feedback.
I believe the existing partial inliner pass does use some common
utilities like the code extractor to do outlining.
AFAIK, Code extractor is the only IR level utility for outlining, but
it's not the best and definitely needs some improvement. It currently
returns outputs via live out pointer parameters, which means that
outlined functions can't be tagged with attributes like readonly.
There were also some known problems with extracting in the presence of
critical edges, but I don't know if thats still the case.
Is that what you're referring to? I don't recall seeing any other
passes that has outlining other than the machine outliner, but I may
have missed something.
I briefly looked at River's RFC and it seems he's mainly utilizing
outlining as a tool to reduce code size while maintaining code
performance.
The main point of the outliner I proposed in that RFC is benefiting code
size. The use of PGO data is to simply avoid detrimenting runtime
performance.
Our proposal is to improve on the existing outlining scheme of the
partial inliner to give us more opportunities to inline hot code. We
will only do this type of general outlining with the presence of
profiling or user data (such as builtin_expect).
With that said, I am interested in how River plans to handle live
ranges
It's a pretty difficult problem. I have estimation for register usage
within an outlined region, using something similar to the register cost
calculation in the loop vectorizer. I don't currently have anything for
the actual call to the function itself. I haven't found a nice enough
solution that can work for a large number of candidates.
and heuristics he will use with profiling data available.
As I said above, the use of profile data, currently, is to avoid
regressing execution time. So the heuristics are basically "Is this too
hot to outline from". That outliner also has a completely different
goal/cost model, which makes applying it to this problem difficult.
Thanks,
River Riddle
I haven't looked at the machine outliner yet, but my understanding is
it also prioritizes code size reduction over performance gains. Is my
assumption correct? Do you think there's some tuning we can do in the
machine outliner to tailor for performance improvements?
Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-407/8200/Markham
Email: gyiu at ca.ibm.com
Inactive hide details for Jessica Paquette ---08/15/2017 02:45:06
PM---Hi Graham, Have you looked at using the existing outlineJessica
Paquette ---08/15/2017 02:45:06 PM---Hi Graham, Have you looked at
using the existing outliner framework to implement partial inlining? I
From: Jessica Paquette <jpaquette at apple.com>
To: Graham Yiu <gyiu at ca.ibm.com>
Cc: llvm-dev at lists.llvm.org
Date: 08/15/2017 02:45 PM
Subject: Re: [llvm-dev] [RFC] Enhance Partial Inliner by using a
general outlining scheme for cold blocks
Sent by: jpaquette at apple.com
Hi Graham,
Have you looked at using the existing outliner framework to implement
partial inlining? In the past, people have combined outlining +
inlining to get partial inlining “for free” (see "Function outlining
and partial inlining” by Zhao and Amaral).
There is the MachineOutliner. River Riddle has been working on an
IR-level outliner as well. The MachineOutliner doesn’t currently
consider the heat of a block, but might be worth experimenting with.
It might be good to experiment with River’s work and the existing
outlining pass to see if this functionality can effectively fall out
of what we have already.
Here’s the recent IR-level outlining thread for reference wrt River’s
work: http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html
- Jessica
On Aug 15, 2017, at 11:22 AM, Graham Yiu via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
Hello,
My team and I are looking to do some enhancements in the
partial inliner in opt. Would appreciate any feedback that
folks might have.
# Partial Inlining in LLVM opt
## Summary
### Background
Currently, the partial inliner searches the first few
blocks of the callee and looks for a branch to the return
block (ie. early return). If found, it attempts to outline
the rest of the slow (or heavy) code so the inliner will
be able to inline the fast (or light) code. If no early
returns are found, the partial inliner will give up. As
far as I can tell, BlockFrequency and BranchProbability
information is only used when attempting to inline the
early return code, and not used to determine whether to
outline the slow code.
### Proposed changes
In addition to looking for early returns, we should
utilize profile information to outline blocks that are
considered cold. If we can sufficiently reduce the size of
the original function via this type of outlining, inlining
should be able to inline the rest of the hot code.
## Details
With the presence of profile information, we have a view
of what code is infrequently executed and make better
decisions on what to outline. Early return blocks that are
infrequently executed should still be included as
candidates for outlining, but will be treated just like
any other cold blocks. Without profiling information,
however, we should remain conservative and only partial
inline in the presence of an early return in the first few
blocks of a function (ie. peel the early return out of the
function).
To find cold regions to outline, we will traverse the CFG
to find edges deemed 'cold' and look at the blocks
dominated by the successor node. If, for some reason, that
block has more than one predecessor, then we will skip
this candidate as there should be a node that dominates
this successor that has a single entry point. The last
node in the dominance vector should also have a single
successor. If it does not, then further investigation of
the CFG is necessary to see when/how this situation
occurs.
We will need several heuristics to make sure we only
outline in cases where we are confident it will result in
a performance gain. Things such as threshold on when a
branch is considered cold, the minimum number of times the
predecessor node has to be executed in order for an edge
to be considered (confidence factor), and the minimum size
of the region to be outlined (can use inlining cost
analysis like we currently do) will require some level of
tuning.
Similar to the current implementation, we will attempt to
inline the leftover (hot) parts of the code, and if for
some reason we cannot then we discard the modified
function and its outlined code.
### Code changes
The current Partial Inlining code first clones the
function of interest and looks for a single set of blocks
to outline. It then creates a function with the set the
blocks, and saves the outlined function and outline
callsite information as part of the function cloning
container. In order to outline multiple regions of the
function, we will need to change these containers to keep
track of a list of regions to outline. We will also need
to update the cost analysis to take into account multiple
outlined functions.
When a ProfileSummary is available, then we should skip
the code that looks for early returns and go into new code
that looks for cold regions to outline. When
ProfileSummary is not available, then we can fall back to
the existing code and look for early returns only.
### Tuning
- The outlining heuristics will need to determine if a set
of cold blocks is large enough to warrant the overhead of
a function call. We also don't want the inliner to attempt
to inline the outlined code later.
- The threshold for determining whether a block is cold
will also need to be tuned. In the case that profiling
information is not accurate, we will pay the price of the
additional call overhead for executing cold code.
- The confidence factor, which can be viewed as the
minimum number of times the predecessor has to be executed
in order for an edge to be considered cold, should also be
taken into account to avoid outlining code paths we have
little information on.
Graham Yiu
LLVM Compiler Development
IBM Toronto Software Lab
Office: (905) 413-4077 C2-407/8200/Markham
Email: gyiu at ca.ibm.com
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
--
Thank you,
River Riddle
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/41523463/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170815/41523463/attachment-0001.gif>
More information about the llvm-dev
mailing list