[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.
River Riddle via llvm-dev
llvm-dev at lists.llvm.org
Mon Jul 24 10:53:19 PDT 2017
The code hasn't been posted to Phabricator yet, we wanted to get feedback
on the current design and approach first. The code is forked on GitHub
currently and a link was included in the original email. As for the inliner
having an effect on the potential benefits, I just rebuilt llvm-tblgen with
"-fno-inline -fno-inline-functions" to see if there was any change in
- The inlining enabled Oz binary was ~20% smaller than the no-inline
- LO still gave ~2% benefit(down from 2.5)
- E+LO gave ~2.5% benefit(down from 3.82)
During testing I have noticed that E+LO functions often have other
functions inlined into them, which possibly leads to some of the benefit
that can be observed from it. I haven't really played around with the
inlining heuristics myself, so I likely won't be able to help very much
further in that regard.
On Mon, Jul 24, 2017 at 5:38 AM, Evgeny Astigeevich <
Evgeny.Astigeevich at arm.com> wrote:
> Hi River,
> Thank you for information. Do you have your code in Phabricator?
> As I understand your pass always runs after the Inliner. At the moment
> the Inliner is too aggressive at Os/Oz. I am trying to tune Inliner
> heuristics for Oz. It would be interesting to see if there are any code
> size improvements when the Inliner is disabled. This will demonstrate
> whether the Outliner could improve code size when duplicated code is not
> due to the Inliner. Another problem I have is that C++ programs get code
> size regressions due my changes in the inline heuristics. Have you seen
> this problem?
> Evgeny Astigeevich
> *From:* River Riddle <riddleriver at gmail.com>
> *Sent:* Friday, July 21, 2017 6:01:56 PM
> *To:* Evgeny Astigeevich
> *Cc:* llvm-dev at lists.llvm.org; nd; Kristof Beyls; gaborb at inf.u-szeged.hu
> *Subject:* Re: [llvm-dev] [RFC] Add IR level interprocedural outliner for
> code size.
> Hi Evgeny,
> I know of the current machine outliner in LLVM. If you look in the "More
> detailed performance data" in the end section it includes performance
> comparisons to the machine outliner.
> As for the algorithmic approach they are kind of similar.
> Machine Outliner:
> - Builds a suffix tree based on identical equivalence between machine
> - Uses target specific cost model for benefit estimation.
> - Uses a greedy algorithm for candidate pruning.
> - Currently supports X86, AArch64 targets.
> - Requires NoRedZone on supported targets.
> IR Outliner:
> - Builds a suffix array based upon relaxed equivalence between ir instrs.
> - Allows for inputs/outputs from an outlined function.
> - Complex verification of outlining candidates due to above two
> points(relaxed equivalency, inputs/outputs).
> - Tunable generic target independent cost model (estimates constant
> folded inputs, output condensing, etc.).
> - Note it uses the target info available (TargetTransformInfo)
> - Uses a greedy interval based algorithm for candidate pruning.
> - Supports profile data.
> - Supports emitting opt remarks.
> - Target independent, thus no required ABI constraints or hooks.
> Though some parts of the approach are similar the implementation is very
> different. Also to note, the IR outliner can be expanded to also outline
> from multiple basic blocks. I've already prototyped the infrastructure for
> outlining regions, so its possible.
> River Riddle
> On Fri, Jul 21, 2017 at 2:24 AM, Evgeny Astigeevich <
> Evgeny.Astigeevich at arm.com> wrote:
>> Hi River,
>> Do you know LLVM has the MachineOutliner pass (
>> It would be interesting to compare your approach with it.
>> Evgeny Astigeevich
>> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of River
>> Riddle via llvm-dev <llvm-dev at lists.llvm.org>
>> *Sent:* Thursday, July 20, 2017 11:47:57 PM
>> *To:* llvm-dev at lists.llvm.org
>> *Subject:* [llvm-dev] [RFC] Add IR level interprocedural outliner for
>> code size.
>> I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve
>> been working on an interprocedural outlining (code folding) pass for code
>> size improvement at the IR level. We hit a couple of use cases that the
>> current code size solutions didn’t handle well enough. Outlining is one of
>> the avenues that seemed potentially beneficial.
>> -- Algorithmic Approach --
>> The general implementation can be explained in stages:
>> * Candidate Selection:
>> Each instruction in the module is mapped to a congruence class based upon
>> relaxed equivalency constraints. For most instructions this is simply: The
>> type, opcode, and operand types. The constraints are tightened for
>> instructions with special state that require more exact equivalence (e.g.
>> ShuffleVector requires a constant mask vector). Candidates are then found
>> by constructing a suffix array and lcp(longest common prefix) array.
>> Walking the lcp array gives us congruent chains of instructions within the
>> * Candidate Analysis:
>> A verification step splits candidates that have different internal input
>> sequences or incompatible parent function attributes between occurrences.
>> An example of incompatible internal inputs sequences is:
>> X = W + 6; vs X = W + 6;
>> Y = X + 4; Y = W + 4;
>> The above two occurrences would need special control flow to exist within
>> the same outlined function.
>> During analysis candidates have their inputs and outputs computed along
>> with an estimated benefit from extraction. During input calculation a
>> constant folding step removes all inputs that are the same amongst all
>> * Candidate Pruning:
>> Overlapping candidates are pruned with a generic greedy algorithm that
>> picks occurrences starting from the most beneficial candidates.
>> * Candidate Outlining:
>> Non pruned candidates are then outlined. Outputs from a candidate are
>> returned via an output parameter, except for one output that is promoted to
>> a return value. During outlining the inputs into the candidate are
>> condensed by computing the equivalencies between the arguments at each
>> occurrence. An example of this is:
>> outlinedFn(1,6,1); -> outlinedFn(1,6);
>> outlinedFn(2,4,2); -> outlinedFn(2,4);
>> In the above, parameters 1 and 3 were found to be equivalent for all
>> occurrences, thus the amount of inputs was reduced to 2.
>> * Debug Info:
>> Debug information is preserved for the calls to functions which have been
>> outlined but all debug info from the original outlined portions is removed,
>> making them harder to debug.
>> * Profile Info:
>> If the pass is running at Os the outliner will only consider cold blocks,
>> whereas Oz considers all blocks that are not marked as hot.
>> * Location in Pipeline:
>> The pass is currently configured to run very late in the optimization
>> pipeline. It is intended to run at Oz but will also run at Os if there is
>> profile data available. The pass can optionally be run twice, once before
>> function simplification and then again at the default location. This run is
>> optional because you are gambling the potential benefits of redundancy
>> elimination vs the potential benefits from function simplification. This
>> can lead to large benefits or regressions depending on the benchmark
>> (though the benefits tend to outnumber the regressions). The performance
>> section contains data for both on a variety of benchmarks.
>> -- Why at the IR level --
>> The decision to place the outliner at the IR level comes from a couple of
>> major factors:
>> - Desire to be target independent
>> - Most opportunities for congruency
>> The downside to having this type of transformation be at the IR level is
>> it means there will be less accuracy in the cost model - we can somewhat
>> accurately model the cost per instruction but we can’t get information on
>> how a window of instructions may lower. This can cause regressions
>> depending on the platform/codebase, therefore to help alleviate this there
>> are several tunable parameters for the cost model.
>> -- Performance --
>> More results including clang, llvm-tblgen, and more specific numbers
>> about benefits/regressions can be found in the notes section below.
>> * Size Reduction:
>> - Test Suite(X86_64):
>> - Early+Late outlining provides a geomean of 10.5% reduction over
>> clang Oz, with a largest improvement of ~67% and largest regression of
>> - Late outlining provides a geomean of 4.65% reduction, with a largest
>> improvement of ~51% and largest regression of ~6.4%.
>> - Spec 2006(X86_64)
>> - Early+Late outlining provides a geomean reduction of 2.08%.
>> - Late outlining provides 2.09%.
>> - CSiBE(AArch64)
>> - Early+Late outlining shows geomean reduction of around 3.5%.
>> - Late outlining shows 3.1%.
>> * Compile Time:
>> Compile time was tested under test-suite with a multisample of 5.
>> - Early+Late outlining
>> - Many improvements with > 40% compile time reduction.
>> - Few regressions.
>> - Late outlining
>> - Greatest improvement is ~7.8%
>> - Greatest regression is ~4% with a difference of <0.02s
>> Our explanation for the compile time reduction during early outlining is
>> that due to the amount of redundancy reduction early in the optimization
>> pipeline there is a reduction in the amount of instruction processing
>> during the rest of the compilation.
>> * Execution Time:
>> Ran with a multisample of 5.
>> - Test Suite:
>> - Early+Late outlining has many regressions up to 97%. The greatest
>> improvement was around 7.5%.
>> - Late outlining also has several regressions up to 44% and a greatest
>> improvement of around 5.3%.
>> - Spec:
>> - Early+Late has a geomean regression of 3.5%.
>> - Late outlining has a geomean regression of 1.25%.
>> The execution time results are to be expected given that the outliner,
>> without profile data, will extract from whatever region it deems
>> profitable. Extracting from the hot path can lead to a noticeable
>> performance regression on any platform, which can be somewhat avoided by
>> providing profile data during outlining.
>> -- Tested Improvements --
>> * MergeFunctions:
>> - No noticeable benefit.
>> * LTO:
>> - LTO doesn’t have a code size pipeline, but %reductions over LTO are
>> comparable to non LTO.
>> * Input/Output Partitioning:
>> -This identifies inputs/outputs that may be folded by splitting a
>> candidate. The benefit is minimal for the computations required.
>> * Similar Candidate Merging:
>> - The benefit to be gained is currently not worth the large complexity
>> required to catch the desired cases.
>> -- Potential Improvements --
>> * Suffix&LCP array construction: The pass currently uses a very basic
>> implementation that could be improved. There are definitely faster
>> algorithms and some can construct both simultaneously. We will investigate
>> this as a potential benefit for compile time in the future.
>> * Os performance tuning: Under -Os the pass currently only runs on cold
>> blocks. Ideally we could expand this to be a little more lax on less
>> frequently executed blocks that aren’t cold.
>> * Candidate Selection: The algorithm currently focuses on the longest
>> common sequences. More checks could be added to see if shortening the
>> sequence length produces a larger benefit(e.g less inputs/outputs). This
>> would likely lead to an increase in compile time but should remain less
>> than the baseline.
>> * Non Exact Functions: The outliner currently ignores functions that do
>> not have an exact definition.
>> -- --
>> * CSiBE(Code Size Benchmark):
>> * More detailed performance data:
>> * Implementation:
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev