[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Mon Jul 24 05:38:16 PDT 2017

Hi River,

Thank you for information. Do you have your code in Phabricator?

As I understand your pass always runs after the Inliner. At the moment the Inliner is too aggressive at Os/Oz. I am trying to tune Inliner heuristics for Oz. It would be interesting to see if there are any code size improvements when the Inliner is disabled. This will demonstrate whether the Outliner could improve code size when duplicated code is not due to the Inliner. Another problem I have is that C++ programs get code size regressions due my changes in the inline heuristics. Have  you seen this problem?

Thanks,

Evgeny Astigeevich

________________________________
From: River Riddle <riddleriver at gmail.com>
Sent: Friday, July 21, 2017 6:01:56 PM
To: Evgeny Astigeevich
Cc: llvm-dev at lists.llvm.org; nd; Kristof Beyls; gaborb at inf.u-szeged.hu
Subject: Re: [llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Hi Evgeny,
 I know of the current machine outliner in LLVM. If you look in the "More detailed performance data"  in the end section it includes performance comparisons to the machine outliner.
 As for the algorithmic approach they are kind of similar.
Machine Outliner:
  - Builds a suffix tree based on identical equivalence between machine instrs.
  - Uses target specific cost model for benefit estimation.
  - Uses a greedy algorithm for candidate pruning.
  - Currently supports X86, AArch64 targets.
  - Requires NoRedZone on supported targets.

IR Outliner:
  - Builds a suffix array based upon relaxed equivalence between ir instrs.
  - Allows for inputs/outputs from an outlined function.
  - Complex verification of outlining candidates due to above two points(relaxed equivalency, inputs/outputs).
  - Tunable generic target independent cost model (estimates constant folded inputs, output condensing, etc.).
       - Note it uses the target info available (TargetTransformInfo)
  - Uses a greedy interval based algorithm for candidate pruning.
  - Supports profile data.
  - Supports emitting opt remarks.
  - Target independent, thus no required ABI constraints or hooks.
Though some parts of the approach are similar the implementation is very different. Also to note, the IR outliner can be expanded to also outline from multiple basic blocks. I've already prototyped the infrastructure for outlining regions, so its possible.

Thanks,
  River Riddle

On Fri, Jul 21, 2017 at 2:24 AM, Evgeny Astigeevich <Evgeny.Astigeevich at arm.com<mailto:Evgeny.Astigeevich at arm.com>> wrote:

Hi River,

Do you know LLVM has the MachineOutliner pass (lib/CodeGen/MachineOutliner.cpp)?

It would be interesting to compare your approach with it.

Thanks,

Evgeny Astigeevich

________________________________
From: llvm-dev <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces at lists.llvm.org>> on behalf of River Riddle via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Sent: Thursday, July 20, 2017 11:47:57 PM
To: llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
Subject: [llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

 I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve been working on an interprocedural outlining (code folding) pass for code size improvement at the IR level. We hit a couple of use cases that the current code size solutions didn’t handle well enough. Outlining is one of the avenues that seemed potentially beneficial.

-- Algorithmic Approach --

The general implementation can be explained in stages:

* Candidate Selection:

Each instruction in the module is mapped to a congruence class based upon relaxed equivalency constraints. For most instructions this is simply: The type, opcode, and operand types. The constraints are tightened for instructions with special state that require more exact equivalence (e.g. ShuffleVector requires a constant mask vector). Candidates are then found by constructing a suffix array and lcp(longest common prefix) array. Walking the lcp array gives us congruent chains of instructions within the module.

* Candidate Analysis:

A verification step splits candidates that have different internal input sequences or incompatible parent function attributes between occurrences. An example of incompatible internal inputs sequences is:

X = W + 6;   vs    X = W + 6;

Y = X + 4;            Y = W + 4;

The above two occurrences would need special control flow to exist within the same outlined function.

During analysis candidates have their inputs and outputs computed along with an estimated benefit from extraction. During input calculation a constant folding step removes all inputs that are the same amongst all occurrences.

* Candidate Pruning:

Overlapping candidates are pruned with a generic greedy algorithm that picks occurrences starting from the most beneficial candidates.

* Candidate Outlining:

Non pruned candidates are then outlined. Outputs from a candidate are returned via an output parameter, except for one output that is promoted to a return value. During outlining the inputs into the candidate are condensed by computing the equivalencies between the arguments at each occurrence. An example of this is:

outlinedFn(1,6,1);  ->  outlinedFn(1,6);

outlinedFn(2,4,2);  ->  outlinedFn(2,4);

In the above, parameters 1 and 3 were found to be equivalent for all occurrences, thus the amount of inputs was reduced to 2.

* Debug Info:

Debug information is preserved for the calls to functions which have been outlined but all debug info from the original outlined portions is removed, making them harder to debug.

* Profile Info:

If the pass is running at Os the outliner will only consider cold blocks, whereas Oz considers all blocks that are not marked as hot.

* Location in Pipeline:

The pass is currently configured to run very late in the optimization pipeline. It is intended to run at Oz but will also run at Os if there is profile data available. The pass can optionally be run twice, once before function simplification and then again at the default location. This run is optional because you are gambling the potential benefits of redundancy elimination vs the potential benefits from function simplification. This can lead to large benefits or regressions depending on the benchmark (though the benefits tend to outnumber the regressions). The performance section contains data for both on a variety of benchmarks.

-- Why at the IR level --

The decision to place the outliner at the IR level comes from a couple of major factors:

- Desire to be target independent

- Most opportunities for congruency

The downside to having this type of transformation be at the IR level is it means there will be less accuracy in the cost model -  we can somewhat accurately model the cost per instruction but we can’t get information on how a window of instructions may lower. This can cause regressions depending on the platform/codebase, therefore to help alleviate this there are several tunable parameters for the cost model.

-- Performance --

More results including clang, llvm-tblgen, and more specific numbers about benefits/regressions can be found in the notes section below.

* Size Reduction:

- Test Suite(X86_64):

   - Early+Late outlining provides a geomean of 10.5% reduction over clang Oz, with a largest improvement of ~67% and largest regression of ~7.5%.

   - Late outlining provides a geomean of 4.65% reduction, with a largest improvement of ~51% and largest regression of ~6.4%.

- Spec 2006(X86_64)

   - Early+Late outlining provides a geomean reduction of 2.08%.

   - Late outlining provides 2.09%.

- CSiBE(AArch64)

   - Early+Late outlining shows geomean reduction of around 3.5%.

   - Late outlining shows 3.1%.

* Compile Time:

Compile time was tested under test-suite with a multisample of 5.

- Early+Late outlining

   - Many improvements with > 40% compile time reduction.

   - Few regressions.

- Late outlining

   - Greatest improvement is ~7.8%

   - Greatest regression is ~4% with a difference of <0.02s

Our explanation for the compile time reduction during early outlining is that due to the amount of redundancy reduction early in the optimization pipeline there is a reduction in the amount of instruction processing during the rest of the compilation.

* Execution Time:

Ran with a multisample of 5.

- Test Suite:

   - Early+Late outlining has many regressions up to 97%. The greatest improvement was around 7.5%.

   - Late outlining also has several regressions up to 44% and a greatest improvement of around 5.3%.

- Spec:

   - Early+Late has a geomean regression of 3.5%.

   - Late outlining has a geomean regression of 1.25%.

The execution time results are to be expected given that the outliner, without profile data, will extract from whatever region it deems profitable. Extracting from the hot path can lead to a noticeable performance regression on any platform, which can be somewhat avoided by providing profile data during outlining.

-- Tested Improvements --

* MergeFunctions:

   - No noticeable benefit.

* LTO:

   - LTO doesn’t have a code size pipeline, but %reductions over LTO are comparable to non LTO.

* Input/Output Partitioning:

   -This identifies inputs/outputs that may be folded by splitting a candidate. The benefit is minimal for the computations required.

* Similar Candidate Merging:

   - The benefit to be gained is currently not worth the large complexity required to catch the desired cases.

-- Potential Improvements --

* Suffix&LCP array construction: The pass currently uses a very basic implementation that could be improved. There are definitely faster algorithms and some can construct both simultaneously. We will investigate this as a potential benefit for compile time in the future.

* Os performance tuning: Under -Os the pass currently only runs on cold blocks. Ideally we could expand this to be a little more lax on less frequently executed blocks that aren’t cold.

* Candidate Selection: The algorithm currently focuses on the longest common sequences. More checks could be added to see if shortening the sequence length produces a larger benefit(e.g less inputs/outputs). This would likely lead to an increase in compile time but should remain less than the baseline.

* Non Exact Functions: The outliner currently ignores functions that do not have an exact definition.

-- --

* CSiBE(Code Size Benchmark):

www.csibe.org<http://www.csibe.org/>

* More detailed performance data:

goo.gl/5k6wsP<http://goo.gl/5k6wsP>

* Implementation:

https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/CodeSizeOutliner.cpp

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/feb6370d/attachment-0001.html>