[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Mon Jul 24 21:43:24 PDT 2017

On Mon, Jul 24, 2017 at 9:25 PM, River Riddle <riddleriver at gmail.com> wrote:

> Hey Sean,
>   The bit about the attributes is good to know. When LTO is enabled the
> early run will still run per-TU but the late run will be shifted to work on
> the full LTO bitcode. Also I don't have specific numbers on how often
> parameterization is utilized but I can assure you that it's a majority of
> the time.
>

Interesting. It would be good to have some specific data on this (is
"majority" actually 65% or 99%? How does it vary across benchmarks?),
because this is something that can't be done in the current post-RA machine
outliner (even in principle).

A pre-RA MIR-level outliner would have similar issues to your LLVM IR
outliner w.r.t. estimating instruction sizes.
For example, pre-RA you might see a virtual register but it might actually
end up spilled so there will be spill/fill code size that is unaccounted
for (in fact, I wouldn't be surprised if this inaccuracy of the cost model
was actually greater than the inaccuracy from assuming a fixed cost per
instruction either at pre-RA MIR level or LLVM IR level; I don't have data
supporting one way or the other though).
Also, outlining pre-RA inherently constrains the register allocator, so
there will be indirect effects on code size. It's not clear that doing
things at MIR level pre-RA will really allow avoiding this any more than
marking LLVM IR level outlined functions using an appropriate calling
convention (and maybe a target hook or two).

-- Sean Silva

> I'll look into transforming the data into a visual format since its in
> google docs anyways.
> Thanks,
>  River Riddle
>
> On Mon, Jul 24, 2017 at 9:17 PM Sean Silva <chisophugis at gmail.com> wrote:
>
>> On Thu, Jul 20, 2017 at 3:47 PM, River Riddle via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>>  I’m River and I’m a compiler engineer at PlayStation. Recently, I’ve
>>> been working on an interprocedural outlining (code folding) pass for code
>>> size improvement at the IR level. We hit a couple of use cases that the
>>> current code size solutions didn’t handle well enough. Outlining is one of
>>> the avenues that seemed potentially beneficial.
>>>
>>> -- Algorithmic Approach --
>>>
>>> The general implementation can be explained in stages:
>>>
>>> * Candidate Selection:
>>>
>>> Each instruction in the module is mapped to a congruence class based
>>> upon relaxed equivalency constraints. For most instructions this is simply:
>>> The type, opcode, and operand types. The constraints are tightened for
>>> instructions with special state that require more exact equivalence (e.g.
>>> ShuffleVector requires a constant mask vector). Candidates are then found
>>> by constructing a suffix array and lcp(longest common prefix) array.
>>> Walking the lcp array gives us congruent chains of instructions within the
>>> module.
>>>
>>> * Candidate Analysis:
>>>
>>> A verification step splits candidates that have different internal input
>>> sequences or incompatible parent function attributes between occurrences.
>>> An example of incompatible internal inputs sequences is:
>>>
>>> X = W + 6;   vs    X = W + 6;
>>>
>>> Y = X + 4;            Y = W + 4;
>>>
>>> The above two occurrences would need special control flow to exist
>>> within the same outlined function.
>>>
>>> During analysis candidates have their inputs and outputs computed along
>>> with an estimated benefit from extraction. During input calculation a
>>> constant folding step removes all inputs that are the same amongst all
>>> occurrences.
>>>
>>> * Candidate Pruning:
>>>
>>> Overlapping candidates are pruned with a generic greedy algorithm that
>>> picks occurrences starting from the most beneficial candidates.
>>>
>>> * Candidate Outlining:
>>>
>>> Non pruned candidates are then outlined. Outputs from a candidate are
>>> returned via an output parameter, except for one output that is promoted to
>>> a return value. During outlining the inputs into the candidate are
>>> condensed by computing the equivalencies between the arguments at each
>>> occurrence. An example of this is:
>>>
>>> outlinedFn(1,6,1);  ->  outlinedFn(1,6);
>>>
>>> outlinedFn(2,4,2);  ->  outlinedFn(2,4);
>>>
>>> In the above, parameters 1 and 3 were found to be equivalent for all
>>> occurrences, thus the amount of inputs was reduced to 2.
>>>
>>> * Debug Info:
>>>
>>> Debug information is preserved for the calls to functions which have
>>> been outlined but all debug info from the original outlined portions is
>>> removed, making them harder to debug.
>>>
>>> * Profile Info:
>>>
>>> If the pass is running at Os the outliner will only consider cold
>>> blocks, whereas Oz considers all blocks that are not marked as hot.
>>>
>>> * Location in Pipeline:
>>>
>>> The pass is currently configured to run very late in the optimization
>>> pipeline. It is intended to run at Oz but will also run at Os if there is
>>> profile data available. The pass can optionally be run twice, once before
>>> function simplification and then again at the default location. This run is
>>> optional because you are gambling the potential benefits of redundancy
>>> elimination vs the potential benefits from function simplification. This
>>> can lead to large benefits or regressions depending on the benchmark
>>> (though the benefits tend to outnumber the regressions). The performance
>>> section contains data for both on a variety of benchmarks.
>>>
>>> -- Why at the IR level --
>>>
>>> The decision to place the outliner at the IR level comes from a couple
>>> of major factors:
>>>
>>> - Desire to be target independent
>>>
>>> - Most opportunities for congruency
>>>
>>> The downside to having this type of transformation be at the IR level is
>>> it means there will be less accuracy in the cost model -  we can somewhat
>>> accurately model the cost per instruction but we can’t get information on
>>> how a window of instructions may lower. This can cause regressions
>>> depending on the platform/codebase, therefore to help alleviate this there
>>> are several tunable parameters for the cost model.
>>>
>>> -- Performance --
>>>
>>> More results including clang, llvm-tblgen, and more specific numbers
>>> about benefits/regressions can be found in the notes section below.
>>>
>>> * Size Reduction:
>>>
>>> - Test Suite(X86_64):
>>>
>>>    - Early+Late outlining provides a geomean of 10.5% reduction over
>>> clang Oz, with a largest improvement of ~67% and largest regression of
>>> ~7.5%.
>>>
>>>    - Late outlining provides a geomean of 4.65% reduction, with a
>>> largest improvement of ~51% and largest regression of ~6.4%.
>>>
>>> - Spec 2006(X86_64)
>>>
>>>    - Early+Late outlining provides a geomean reduction of 2.08%.
>>>
>>>    - Late outlining provides 2.09%.
>>>
>>> - CSiBE(AArch64)
>>>
>>>    - Early+Late outlining shows geomean reduction of around 3.5%.
>>>
>>>    - Late outlining shows 3.1%.
>>>
>>
>>
>> It would be good to visualize these results. Maybe a bar chart like
>> https://goo.gl/qN2HqA from http://blog.llvm.org/2016
>> /06/thinlto-scalable-and-incremental-lto.html for SPEC?
>>
>>
>>>
>>> * Compile Time:
>>>
>>> Compile time was tested under test-suite with a multisample of 5.
>>>
>>> - Early+Late outlining
>>>
>>>    - Many improvements with > 40% compile time reduction.
>>>
>>>    - Few regressions.
>>>
>>> - Late outlining
>>>
>>>    - Greatest improvement is ~7.8%
>>>
>>>    - Greatest regression is ~4% with a difference of <0.02s
>>>
>>> Our explanation for the compile time reduction during early outlining is
>>> that due to the amount of redundancy reduction early in the optimization
>>> pipeline there is a reduction in the amount of instruction processing
>>> during the rest of the compilation.
>>>
>>> * Execution Time:
>>>
>>> Ran with a multisample of 5.
>>>
>>> - Test Suite:
>>>
>>>    - Early+Late outlining has many regressions up to 97%. The greatest
>>> improvement was around 7.5%.
>>>
>>>    - Late outlining also has several regressions up to 44% and a
>>> greatest improvement of around 5.3%.
>>>
>>> - Spec:
>>>
>>>    - Early+Late has a geomean regression of 3.5%.
>>>
>>>    - Late outlining has a geomean regression of 1.25%.
>>>
>>> The execution time results are to be expected given that the outliner,
>>> without profile data, will extract from whatever region it deems
>>> profitable. Extracting from the hot path can lead to a noticeable
>>> performance regression on any platform, which can be somewhat avoided by
>>> providing profile data during outlining.
>>>
>>> -- Tested Improvements --
>>>
>>> * MergeFunctions:
>>>
>>>    - No noticeable benefit.
>>>
>>> * LTO:
>>>
>>>    - LTO doesn’t have a code size pipeline, but %reductions over LTO are
>>> comparable to non LTO.
>>>
>>
>> -Os/-Oz are communicated through the optsize and minsize attributes.
>> There isn't a specific code size pipeline per se (I think this is true for
>> per-TU compilation as well, though I would have to check).
>>
>> Also, can you clarify what you mean by "LTO"? I assume this means that
>> the outliner did not run during per-TU compilation but did run on the
>> combined FullLTO bitcode, but want to check to be sure.
>>
>> -- Sean Silva
>>
>>
>>> * Input/Output Partitioning:
>>>
>>>    -This identifies inputs/outputs that may be folded by splitting a
>>> candidate. The benefit is minimal for the computations required.
>>>
>>> * Similar Candidate Merging:
>>>
>>>    - The benefit to be gained is currently not worth the large
>>> complexity required to catch the desired cases.
>>>
>>> -- Potential Improvements --
>>>
>>> * Suffix&LCP array construction: The pass currently uses a very basic
>>> implementation that could be improved. There are definitely faster
>>> algorithms and some can construct both simultaneously. We will investigate
>>> this as a potential benefit for compile time in the future.
>>>
>>> * Os performance tuning: Under -Os the pass currently only runs on cold
>>> blocks. Ideally we could expand this to be a little more lax on less
>>> frequently executed blocks that aren’t cold.
>>>
>>> * Candidate Selection: The algorithm currently focuses on the longest
>>> common sequences. More checks could be added to see if shortening the
>>> sequence length produces a larger benefit(e.g less inputs/outputs). This
>>> would likely lead to an increase in compile time but should remain less
>>> than the baseline.
>>>
>>> * Non Exact Functions: The outliner currently ignores functions that do
>>> not have an exact definition.
>>>
>>> -- --
>>>
>>> * CSiBE(Code Size Benchmark):
>>>
>>> www.csibe.org
>>>
>>> * More detailed performance data:
>>>
>>> goo.gl/5k6wsP
>>>
>>> * Implementation:
>>>
>>> https://github.com/River707/llvm/blob/outliner/lib/Transform
>>> s/IPO/CodeSizeOutliner.cpp
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/14a0d836/attachment-0001.html>