[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Thu Jul 20 15:47:57 PDT 2017

 I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve been
working on an interprocedural outlining (code folding) pass for code size
improvement at the IR level. We hit a couple of use cases that the current
code size solutions didn’t handle well enough. Outlining is one of the
avenues that seemed potentially beneficial.

-- Algorithmic Approach --

The general implementation can be explained in stages:

* Candidate Selection:

Each instruction in the module is mapped to a congruence class based upon
relaxed equivalency constraints. For most instructions this is simply: The
type, opcode, and operand types. The constraints are tightened for
instructions with special state that require more exact equivalence (e.g.
ShuffleVector requires a constant mask vector). Candidates are then found
by constructing a suffix array and lcp(longest common prefix) array.
Walking the lcp array gives us congruent chains of instructions within the
module.

* Candidate Analysis:

A verification step splits candidates that have different internal input
sequences or incompatible parent function attributes between occurrences.
An example of incompatible internal inputs sequences is:

X = W + 6;   vs    X = W + 6;

Y = X + 4;            Y = W + 4;

The above two occurrences would need special control flow to exist within
the same outlined function.

During analysis candidates have their inputs and outputs computed along
with an estimated benefit from extraction. During input calculation a
constant folding step removes all inputs that are the same amongst all
occurrences.

* Candidate Pruning:

Overlapping candidates are pruned with a generic greedy algorithm that
picks occurrences starting from the most beneficial candidates.

* Candidate Outlining:

Non pruned candidates are then outlined. Outputs from a candidate are
returned via an output parameter, except for one output that is promoted to
a return value. During outlining the inputs into the candidate are
condensed by computing the equivalencies between the arguments at each
occurrence. An example of this is:

outlinedFn(1,6,1);  ->  outlinedFn(1,6);

outlinedFn(2,4,2);  ->  outlinedFn(2,4);

In the above, parameters 1 and 3 were found to be equivalent for all
occurrences, thus the amount of inputs was reduced to 2.

* Debug Info:

Debug information is preserved for the calls to functions which have been
outlined but all debug info from the original outlined portions is removed,
making them harder to debug.

* Profile Info:

If the pass is running at Os the outliner will only consider cold blocks,
whereas Oz considers all blocks that are not marked as hot.

* Location in Pipeline:

The pass is currently configured to run very late in the optimization
pipeline. It is intended to run at Oz but will also run at Os if there is
profile data available. The pass can optionally be run twice, once before
function simplification and then again at the default location. This run is
optional because you are gambling the potential benefits of redundancy
elimination vs the potential benefits from function simplification. This
can lead to large benefits or regressions depending on the benchmark
(though the benefits tend to outnumber the regressions). The performance
section contains data for both on a variety of benchmarks.

-- Why at the IR level --

The decision to place the outliner at the IR level comes from a couple of
major factors:

- Desire to be target independent

- Most opportunities for congruency

The downside to having this type of transformation be at the IR level is it
means there will be less accuracy in the cost model -  we can somewhat
accurately model the cost per instruction but we can’t get information on
how a window of instructions may lower. This can cause regressions
depending on the platform/codebase, therefore to help alleviate this there
are several tunable parameters for the cost model.

-- Performance --

More results including clang, llvm-tblgen, and more specific numbers about
benefits/regressions can be found in the notes section below.

* Size Reduction:

- Test Suite(X86_64):

   - Early+Late outlining provides a geomean of 10.5% reduction over clang
Oz, with a largest improvement of ~67% and largest regression of ~7.5%.

   - Late outlining provides a geomean of 4.65% reduction, with a largest
improvement of ~51% and largest regression of ~6.4%.

- Spec 2006(X86_64)

   - Early+Late outlining provides a geomean reduction of 2.08%.

   - Late outlining provides 2.09%.

- CSiBE(AArch64)

   - Early+Late outlining shows geomean reduction of around 3.5%.

   - Late outlining shows 3.1%.

* Compile Time:

Compile time was tested under test-suite with a multisample of 5.

- Early+Late outlining

   - Many improvements with > 40% compile time reduction.

   - Few regressions.

- Late outlining

   - Greatest improvement is ~7.8%

   - Greatest regression is ~4% with a difference of <0.02s

Our explanation for the compile time reduction during early outlining is
that due to the amount of redundancy reduction early in the optimization
pipeline there is a reduction in the amount of instruction processing
during the rest of the compilation.

* Execution Time:

Ran with a multisample of 5.

- Test Suite:

   - Early+Late outlining has many regressions up to 97%. The greatest
improvement was around 7.5%.

   - Late outlining also has several regressions up to 44% and a greatest
improvement of around 5.3%.

- Spec:

   - Early+Late has a geomean regression of 3.5%.

   - Late outlining has a geomean regression of 1.25%.

The execution time results are to be expected given that the outliner,
without profile data, will extract from whatever region it deems
profitable. Extracting from the hot path can lead to a noticeable
performance regression on any platform, which can be somewhat avoided by
providing profile data during outlining.

-- Tested Improvements --

* MergeFunctions:

   - No noticeable benefit.

* LTO:

   - LTO doesn’t have a code size pipeline, but %reductions over LTO are
comparable to non LTO.

* Input/Output Partitioning:

   -This identifies inputs/outputs that may be folded by splitting a
candidate. The benefit is minimal for the computations required.

* Similar Candidate Merging:

   - The benefit to be gained is currently not worth the large complexity
required to catch the desired cases.

-- Potential Improvements --

* Suffix&LCP array construction: The pass currently uses a very basic
implementation that could be improved. There are definitely faster
algorithms and some can construct both simultaneously. We will investigate
this as a potential benefit for compile time in the future.

* Os performance tuning: Under -Os the pass currently only runs on cold
blocks. Ideally we could expand this to be a little more lax on less
frequently executed blocks that aren’t cold.

* Candidate Selection: The algorithm currently focuses on the longest
common sequences. More checks could be added to see if shortening the
sequence length produces a larger benefit(e.g less inputs/outputs). This
would likely lead to an increase in compile time but should remain less
than the baseline.

* Non Exact Functions: The outliner currently ignores functions that do not
have an exact definition.

-- --

* CSiBE(Code Size Benchmark):

www.csibe.org

* More detailed performance data:

goo.gl/5k6wsP

* Implementation:

https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/CodeSizeOutliner.cpp
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170720/8733497e/attachment.html>