[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Mon Jul 24 21:25:45 PDT 2017

Hey Sean,
  The bit about the attributes is good to know. When LTO is enabled the
early run will still run per-TU but the late run will be shifted to work on
the full LTO bitcode. Also I don't have specific numbers on how often
parameterization is utilized but I can assure you that it's a majority of
the time. I'll look into transforming the data into a visual format since
its in google docs anyways.
Thanks,
 River Riddle
On Mon, Jul 24, 2017 at 9:17 PM Sean Silva <chisophugis at gmail.com> wrote:

> On Thu, Jul 20, 2017 at 3:47 PM, River Riddle via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>>  I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve
>> been working on an interprocedural outlining (code folding) pass for code
>> size improvement at the IR level. We hit a couple of use cases that the
>> current code size solutions didn’t handle well enough. Outlining is one of
>> the avenues that seemed potentially beneficial.
>>
>> -- Algorithmic Approach --
>>
>> The general implementation can be explained in stages:
>>
>> * Candidate Selection:
>>
>> Each instruction in the module is mapped to a congruence class based upon
>> relaxed equivalency constraints. For most instructions this is simply: The
>> type, opcode, and operand types. The constraints are tightened for
>> instructions with special state that require more exact equivalence (e.g.
>> ShuffleVector requires a constant mask vector). Candidates are then found
>> by constructing a suffix array and lcp(longest common prefix) array.
>> Walking the lcp array gives us congruent chains of instructions within the
>> module.
>>
>> * Candidate Analysis:
>>
>> A verification step splits candidates that have different internal input
>> sequences or incompatible parent function attributes between occurrences.
>> An example of incompatible internal inputs sequences is:
>>
>> X = W + 6;   vs    X = W + 6;
>>
>> Y = X + 4;            Y = W + 4;
>>
>> The above two occurrences would need special control flow to exist within
>> the same outlined function.
>>
>> During analysis candidates have their inputs and outputs computed along
>> with an estimated benefit from extraction. During input calculation a
>> constant folding step removes all inputs that are the same amongst all
>> occurrences.
>>
>> * Candidate Pruning:
>>
>> Overlapping candidates are pruned with a generic greedy algorithm that
>> picks occurrences starting from the most beneficial candidates.
>>
>> * Candidate Outlining:
>>
>> Non pruned candidates are then outlined. Outputs from a candidate are
>> returned via an output parameter, except for one output that is promoted to
>> a return value. During outlining the inputs into the candidate are
>> condensed by computing the equivalencies between the arguments at each
>> occurrence. An example of this is:
>>
>> outlinedFn(1,6,1);  ->  outlinedFn(1,6);
>>
>> outlinedFn(2,4,2);  ->  outlinedFn(2,4);
>>
>> In the above, parameters 1 and 3 were found to be equivalent for all
>> occurrences, thus the amount of inputs was reduced to 2.
>>
>> * Debug Info:
>>
>> Debug information is preserved for the calls to functions which have been
>> outlined but all debug info from the original outlined portions is removed,
>> making them harder to debug.
>>
>> * Profile Info:
>>
>> If the pass is running at Os the outliner will only consider cold blocks,
>> whereas Oz considers all blocks that are not marked as hot.
>>
>> * Location in Pipeline:
>>
>> The pass is currently configured to run very late in the optimization
>> pipeline. It is intended to run at Oz but will also run at Os if there is
>> profile data available. The pass can optionally be run twice, once before
>> function simplification and then again at the default location. This run is
>> optional because you are gambling the potential benefits of redundancy
>> elimination vs the potential benefits from function simplification. This
>> can lead to large benefits or regressions depending on the benchmark
>> (though the benefits tend to outnumber the regressions). The performance
>> section contains data for both on a variety of benchmarks.
>>
>> -- Why at the IR level --
>>
>> The decision to place the outliner at the IR level comes from a couple of
>> major factors:
>>
>> - Desire to be target independent
>>
>> - Most opportunities for congruency
>>
>> The downside to having this type of transformation be at the IR level is
>> it means there will be less accuracy in the cost model -  we can somewhat
>> accurately model the cost per instruction but we can’t get information on
>> how a window of instructions may lower. This can cause regressions
>> depending on the platform/codebase, therefore to help alleviate this there
>> are several tunable parameters for the cost model.
>>
>> -- Performance --
>>
>> More results including clang, llvm-tblgen, and more specific numbers
>> about benefits/regressions can be found in the notes section below.
>>
>> * Size Reduction:
>>
>> - Test Suite(X86_64):
>>
>>    - Early+Late outlining provides a geomean of 10.5% reduction over
>> clang Oz, with a largest improvement of ~67% and largest regression of
>> ~7.5%.
>>
>>    - Late outlining provides a geomean of 4.65% reduction, with a largest
>> improvement of ~51% and largest regression of ~6.4%.
>>
>> - Spec 2006(X86_64)
>>
>>    - Early+Late outlining provides a geomean reduction of 2.08%.
>>
>>    - Late outlining provides 2.09%.
>>
>> - CSiBE(AArch64)
>>
>>    - Early+Late outlining shows geomean reduction of around 3.5%.
>>
>>    - Late outlining shows 3.1%.
>>
>
>
> It would be good to visualize these results. Maybe a bar chart like
> https://goo.gl/qN2HqA from http://blog.llvm.org/
> 2016/06/thinlto-scalable-and-incremental-lto.html for SPEC?
>
>
>>
>> * Compile Time:
>>
>> Compile time was tested under test-suite with a multisample of 5.
>>
>> - Early+Late outlining
>>
>>    - Many improvements with > 40% compile time reduction.
>>
>>    - Few regressions.
>>
>> - Late outlining
>>
>>    - Greatest improvement is ~7.8%
>>
>>    - Greatest regression is ~4% with a difference of <0.02s
>>
>> Our explanation for the compile time reduction during early outlining is
>> that due to the amount of redundancy reduction early in the optimization
>> pipeline there is a reduction in the amount of instruction processing
>> during the rest of the compilation.
>>
>> * Execution Time:
>>
>> Ran with a multisample of 5.
>>
>> - Test Suite:
>>
>>    - Early+Late outlining has many regressions up to 97%. The greatest
>> improvement was around 7.5%.
>>
>>    - Late outlining also has several regressions up to 44% and a greatest
>> improvement of around 5.3%.
>>
>> - Spec:
>>
>>    - Early+Late has a geomean regression of 3.5%.
>>
>>    - Late outlining has a geomean regression of 1.25%.
>>
>> The execution time results are to be expected given that the outliner,
>> without profile data, will extract from whatever region it deems
>> profitable. Extracting from the hot path can lead to a noticeable
>> performance regression on any platform, which can be somewhat avoided by
>> providing profile data during outlining.
>>
>> -- Tested Improvements --
>>
>> * MergeFunctions:
>>
>>    - No noticeable benefit.
>>
>> * LTO:
>>
>>    - LTO doesn’t have a code size pipeline, but %reductions over LTO are
>> comparable to non LTO.
>>
>
> -Os/-Oz are communicated through the optsize and minsize attributes. There
> isn't a specific code size pipeline per se (I think this is true for per-TU
> compilation as well, though I would have to check).
>
> Also, can you clarify what you mean by "LTO"? I assume this means that the
> outliner did not run during per-TU compilation but did run on the combined
> FullLTO bitcode, but want to check to be sure.
>
> -- Sean Silva
>
>
>> * Input/Output Partitioning:
>>
>>    -This identifies inputs/outputs that may be folded by splitting a
>> candidate. The benefit is minimal for the computations required.
>>
>> * Similar Candidate Merging:
>>
>>    - The benefit to be gained is currently not worth the large complexity
>> required to catch the desired cases.
>>
>> -- Potential Improvements --
>>
>> * Suffix&LCP array construction: The pass currently uses a very basic
>> implementation that could be improved. There are definitely faster
>> algorithms and some can construct both simultaneously. We will investigate
>> this as a potential benefit for compile time in the future.
>>
>> * Os performance tuning: Under -Os the pass currently only runs on cold
>> blocks. Ideally we could expand this to be a little more lax on less
>> frequently executed blocks that aren’t cold.
>>
>> * Candidate Selection: The algorithm currently focuses on the longest
>> common sequences. More checks could be added to see if shortening the
>> sequence length produces a larger benefit(e.g less inputs/outputs). This
>> would likely lead to an increase in compile time but should remain less
>> than the baseline.
>>
>> * Non Exact Functions: The outliner currently ignores functions that do
>> not have an exact definition.
>>
>> -- --
>>
>> * CSiBE(Code Size Benchmark):
>>
>> www.csibe.org
>>
>> * More detailed performance data:
>>
>> goo.gl/5k6wsP
>>
>> * Implementation:
>>
>> https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/
>> CodeSizeOutliner.cpp
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/6c427e9d/attachment.html>