[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Sean Silva via llvm-dev llvm-dev at lists.llvm.org
Mon Jul 24 21:17:46 PDT 2017


On Thu, Jul 20, 2017 at 3:47 PM, River Riddle via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

>  I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve been
> working on an interprocedural outlining (code folding) pass for code size
> improvement at the IR level. We hit a couple of use cases that the current
> code size solutions didn’t handle well enough. Outlining is one of the
> avenues that seemed potentially beneficial.
>
> -- Algorithmic Approach --
>
> The general implementation can be explained in stages:
>
> * Candidate Selection:
>
> Each instruction in the module is mapped to a congruence class based upon
> relaxed equivalency constraints. For most instructions this is simply: The
> type, opcode, and operand types. The constraints are tightened for
> instructions with special state that require more exact equivalence (e.g.
> ShuffleVector requires a constant mask vector). Candidates are then found
> by constructing a suffix array and lcp(longest common prefix) array.
> Walking the lcp array gives us congruent chains of instructions within the
> module.
>
> * Candidate Analysis:
>
> A verification step splits candidates that have different internal input
> sequences or incompatible parent function attributes between occurrences.
> An example of incompatible internal inputs sequences is:
>
> X = W + 6;   vs    X = W + 6;
>
> Y = X + 4;            Y = W + 4;
>
> The above two occurrences would need special control flow to exist within
> the same outlined function.
>
> During analysis candidates have their inputs and outputs computed along
> with an estimated benefit from extraction. During input calculation a
> constant folding step removes all inputs that are the same amongst all
> occurrences.
>
> * Candidate Pruning:
>
> Overlapping candidates are pruned with a generic greedy algorithm that
> picks occurrences starting from the most beneficial candidates.
>
> * Candidate Outlining:
>
> Non pruned candidates are then outlined. Outputs from a candidate are
> returned via an output parameter, except for one output that is promoted to
> a return value. During outlining the inputs into the candidate are
> condensed by computing the equivalencies between the arguments at each
> occurrence. An example of this is:
>
> outlinedFn(1,6,1);  ->  outlinedFn(1,6);
>
> outlinedFn(2,4,2);  ->  outlinedFn(2,4);
>
> In the above, parameters 1 and 3 were found to be equivalent for all
> occurrences, thus the amount of inputs was reduced to 2.
>
> * Debug Info:
>
> Debug information is preserved for the calls to functions which have been
> outlined but all debug info from the original outlined portions is removed,
> making them harder to debug.
>
> * Profile Info:
>
> If the pass is running at Os the outliner will only consider cold blocks,
> whereas Oz considers all blocks that are not marked as hot.
>
> * Location in Pipeline:
>
> The pass is currently configured to run very late in the optimization
> pipeline. It is intended to run at Oz but will also run at Os if there is
> profile data available. The pass can optionally be run twice, once before
> function simplification and then again at the default location. This run is
> optional because you are gambling the potential benefits of redundancy
> elimination vs the potential benefits from function simplification. This
> can lead to large benefits or regressions depending on the benchmark
> (though the benefits tend to outnumber the regressions). The performance
> section contains data for both on a variety of benchmarks.
>
> -- Why at the IR level --
>
> The decision to place the outliner at the IR level comes from a couple of
> major factors:
>
> - Desire to be target independent
>
> - Most opportunities for congruency
>
> The downside to having this type of transformation be at the IR level is
> it means there will be less accuracy in the cost model -  we can somewhat
> accurately model the cost per instruction but we can’t get information on
> how a window of instructions may lower. This can cause regressions
> depending on the platform/codebase, therefore to help alleviate this there
> are several tunable parameters for the cost model.
>
> -- Performance --
>
> More results including clang, llvm-tblgen, and more specific numbers about
> benefits/regressions can be found in the notes section below.
>
> * Size Reduction:
>
> - Test Suite(X86_64):
>
>    - Early+Late outlining provides a geomean of 10.5% reduction over clang
> Oz, with a largest improvement of ~67% and largest regression of ~7.5%.
>
>    - Late outlining provides a geomean of 4.65% reduction, with a largest
> improvement of ~51% and largest regression of ~6.4%.
>
> - Spec 2006(X86_64)
>
>    - Early+Late outlining provides a geomean reduction of 2.08%.
>
>    - Late outlining provides 2.09%.
>
> - CSiBE(AArch64)
>
>    - Early+Late outlining shows geomean reduction of around 3.5%.
>
>    - Late outlining shows 3.1%.
>


It would be good to visualize these results. Maybe a bar chart like
https://goo.gl/qN2HqA from
http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html for
SPEC?


>
> * Compile Time:
>
> Compile time was tested under test-suite with a multisample of 5.
>
> - Early+Late outlining
>
>    - Many improvements with > 40% compile time reduction.
>
>    - Few regressions.
>
> - Late outlining
>
>    - Greatest improvement is ~7.8%
>
>    - Greatest regression is ~4% with a difference of <0.02s
>
> Our explanation for the compile time reduction during early outlining is
> that due to the amount of redundancy reduction early in the optimization
> pipeline there is a reduction in the amount of instruction processing
> during the rest of the compilation.
>
> * Execution Time:
>
> Ran with a multisample of 5.
>
> - Test Suite:
>
>    - Early+Late outlining has many regressions up to 97%. The greatest
> improvement was around 7.5%.
>
>    - Late outlining also has several regressions up to 44% and a greatest
> improvement of around 5.3%.
>
> - Spec:
>
>    - Early+Late has a geomean regression of 3.5%.
>
>    - Late outlining has a geomean regression of 1.25%.
>
> The execution time results are to be expected given that the outliner,
> without profile data, will extract from whatever region it deems
> profitable. Extracting from the hot path can lead to a noticeable
> performance regression on any platform, which can be somewhat avoided by
> providing profile data during outlining.
>
> -- Tested Improvements --
>
> * MergeFunctions:
>
>    - No noticeable benefit.
>
> * LTO:
>
>    - LTO doesn’t have a code size pipeline, but %reductions over LTO are
> comparable to non LTO.
>

-Os/-Oz are communicated through the optsize and minsize attributes. There
isn't a specific code size pipeline per se (I think this is true for per-TU
compilation as well, though I would have to check).

Also, can you clarify what you mean by "LTO"? I assume this means that the
outliner did not run during per-TU compilation but did run on the combined
FullLTO bitcode, but want to check to be sure.

-- Sean Silva


> * Input/Output Partitioning:
>
>    -This identifies inputs/outputs that may be folded by splitting a
> candidate. The benefit is minimal for the computations required.
>
> * Similar Candidate Merging:
>
>    - The benefit to be gained is currently not worth the large complexity
> required to catch the desired cases.
>
> -- Potential Improvements --
>
> * Suffix&LCP array construction: The pass currently uses a very basic
> implementation that could be improved. There are definitely faster
> algorithms and some can construct both simultaneously. We will investigate
> this as a potential benefit for compile time in the future.
>
> * Os performance tuning: Under -Os the pass currently only runs on cold
> blocks. Ideally we could expand this to be a little more lax on less
> frequently executed blocks that aren’t cold.
>
> * Candidate Selection: The algorithm currently focuses on the longest
> common sequences. More checks could be added to see if shortening the
> sequence length produces a larger benefit(e.g less inputs/outputs). This
> would likely lead to an increase in compile time but should remain less
> than the baseline.
>
> * Non Exact Functions: The outliner currently ignores functions that do
> not have an exact definition.
>
> -- --
>
> * CSiBE(Code Size Benchmark):
>
> www.csibe.org
>
> * More detailed performance data:
>
> goo.gl/5k6wsP
>
> * Implementation:
>
> https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/
> CodeSizeOutliner.cpp
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/6112d84e/attachment.html>


More information about the llvm-dev mailing list