[llvm-dev] [RFC] PT.2 Add IR level interprocedural outliner for code size.

Tue Sep 5 16:16:38 PDT 2017

Hey Everybody,

A little while ago I posted an RFC(
http://lists.llvm.org/pipermail/llvm-dev/2017-July/115666.html) with the
proposition of adding a new outliner at the IR level. There was some
confusion and many questions regarding the proposal which I’d like to
address here:

Note about nomenclature:

   Candidate: A repeated sequence of instructions within a module.

   Occurrence: One instance of a candidate sequence.

-- Accompanied Graph Data --

Graph data is referenced in the sections below, any reference to
Graph[*Number*] is referencing the numbered graph in the following document:

https://goo.gl/QDiVHU

---- Performance ----

I have tested the IR outliner and current Machine outliner on a wide
variety of benchmarks. The results include total % reduction of both
geomean and total size. It also includes individual results for each test
in each respective benchmark.

The configurations tested are:

·       Early+Late IR outlining

·       Late IR outlining

·       Machine outlining

·       Early+Late+Machine outlining

·       Late+Machine outlining

NOTE: For fairness in comparisons with the Machine Outliner, all IR
outliner runs also include (-mno-red-zone, outlining from
linkonce_odr/weak_odr functions).

The code size benchmarking results provided are:

* LLVM Test Suite

   -

   X86_64, X86*, AArch64, Arm1176jzf-s*, Arm1176jzf-s-thumb*

* Spec 2006

   -

   X86_64, X86*, AArch64, Arm1176jzf-s*, Arm1176jzf-s-thumb*

* Clang

   -

   X86_64(Mac OS)

* llvm-tblgen

   -

   X86_64(Mac OS)

* CSiBE

   -

   AArch64

* The machine outliner currently only supports X86_64 and AArch64.

Full Code Size Results:

https://goo.gl/ZBjHCG

--- Algorithmic differences with the Machine Outliner ----

          There was a lot of confusion on how exactly the algorithm I am
proposing differs from what is available in the Machine Outliner. The
similarities of the two outliners lie in the usage of a string matching
algorithm and candidate pruning. The first step in the algorithm is to
basically do the same common substring / pruning algorithm the post-RA MO
uses but with a specially chosen congruence relation. I’d like to delve
into the differences between the two:

Congruence Detection:

-        Machine Outliner

  The machine outliner has the advantage of having this problem already
taken care of by the register allocator, it simply checks to see if the two
machine instructions are identical.

-        IR Outliner

  In the IR outliner we work on semantic equivalence, i.e. we care the
operations being performed are equivalent and not the values. This creates
a need to add verification that we do have exact equivalence when we need
it, e.g. ShuffleVector’s shuffle mask, not taking the address of InlineASM,
etc.

A quick example of semantic equivalence:

%1 = add i32 1, i32 2

%2 = add i32 2, i32 3

These two instructions are not identical because the values of the operands
are not identical. They are, however, semantically equivalent because they
both perform the add operation.

This can be seen by simply removing the operand values used in the
calculations:

%1 = add i32 , i32

%2 = add i32 , i32

Occurrence Verification:

-        Machine Outliner

  At the post RA level you don’t need to do any kind of special
verification for candidate occurrences because you don’t have to deal with
the concept of inputs.

-        IR Outliner

 At the IR/preRA level we need to do complex verification to make sure that
the occurrences within a candidate have the same internal inputs. If two
occurrences have different internal inputs then we need some form of
control flow to maintain correctness. By internal inputs I mean the
operands of instructions that come from an instruction within the
occurrence, e.g.

%2 = …

// Start outlining occurrence.

%3 = …

%4 = sub %3, %2 // The first operand is an internal input, the second is
external.

If there is any confusion about why we need control flow for internal
inputs I am more than happy to provide examples and more detailed
explanations.

Aside from internal inputs we also need to verify that the functions we are
outlining from have compatible attributes.

Cost Modeling:

-        Machine Outliner

At the MIR level the cost information is extremely accurate. So cost
modeling is composed of effectively counting the number of instructions and
adding some frame/setup cost.

-        IR Outliner

At the IR level we are working with estimates for the costs of certain
instructions. We try to match the IR cost to the MIR cost as closely as
possible and in practice we can get fairly close(Graph[1]).

  Taking this a step further we need to estimate the cost/setup of having x
amount of parameters and y outputs, as well as the register pressure from
both the call and the potentially outlined function.

Parameterization Optimizations:

-        Machine Outliner

The Machine outliner uses exact equivalence, which does not allow for any
form of parameterization.

-        IR Outliner

          Being at the IR level requires us to tackle parameterization,
which then brings several optimizations to help lower the cost of
parameterizing a sequence.

* Constant Folding

          The IR outliner will identify constant inputs and fold them.

* Congruent Input Condensing

          The outliner identifies the congruent sets of parameters for a
function. Example:

          void fn(int, int); ->  void fn(int);

          fn(1, 1);           ->  fn(1);

          fn(%1, %1);    ->  fn(%1);

Parameters 1 and 2 were found to be the same for each callsite of the
function, so we condensed the congruent parameters.

* Input Partitioning

          The outliner partitions candidates that have a parameter that can
be constant folded. Example:

          fn(1);

          fn(1);

          fn(%1);

Occurrences 1 and 2 in the above candidate can have parameter 1 folded. We
create a new candidate containing just occurrences 1 and 2 as it may be
more profitable than the original candidate.

* Constant int condensing

          The outliner identifies constant int parameters and checks to see
if, for each occurrence, they are an equal distance from other constant int
parameters. If so it removes all but one of the parameters and represents
the others as an add from the base. Example:

          void fn(int a, int b);

          fn(1, 2);

          fn(3, 4);

In the above, parameters 1 and 2 are always a distance of 1 apart. We can
redefine our function as:

          void fn(int a) {

           int b = a + 1;

           …

          }

Register Usage:

-        Machine Outliner

The MO works post RA with exact equivalence, so the most it will compute is
if it needs to save the link register on arm64.

-        IR Outliner

The IR outliner needs to compute register usage for the new outlined
function as well as the usage after generating a function call with x
parameters and y outputs at each program point z.

Outlining:

-        Machine Outliner

          At the MIR level we clone the outlined instructions into a new
function, create some prologue/epilogue for the function, and then generate
a call.

-        IR Outliner

          At the IR level we also have to handle the parameters/outputs of
the candidate. Here we need to merge all of the metadata of outlined
instructions/outlined functions. We also need to identify congruent sets of
parameters between call sites and then folding the amount of parameters
that are needed for the call.

Suffix Array vs Suffix Tree+LCP:

          The two structures should compute the same result, but there is a
non obvious benefit that we get from the suffix array. With the suffix
array approach we identify candidates that shares common occurrences albeit
with a different length. This is very useful for complex
verification/analysis, e.g. at the IR or pre RA level. This allows us to
cache the work when we calculating inputs or verifying the internal inputs
of occurrences. Although this won't be an issue if/when we switch to a
common interface for candidate selection.

---- A replacement for the Machine Outliner? Not exactly ----

The IR outliner was never intended as a replacement for the machine
outliner and the two can coexist. The outliners tend to catch very
different cases: the machine outliner tends to favor very small candidate
lengths. Using a build of llvm-tblgen, the machine outliner gets ~52% of
its benefit from outlined functions of 2-3 instructions. The IR outliner
tends to favor large candidate lengths(2-20+), often composed of function
calls. 52% of the benefit for the IR outliner in the llvm-tblgen example is
found in outlined functions with final lengths up to 17. Data for example
runs of both can be found in the graph data file and is summarized in
Graph[2].

 Included in the performance data are metrics showing the performance of
using both the IR outliner and machine outliner. The data indicates that
you can achieve up to, and exceed, 2% reduction of both geomean and total
size by using both.

---- Pros/Cons of IR----

The current algorithm is implemented at the IR level, but there are trade
offs to placing this transformation anywhere in the
pipeline(IR/preRa/postRA).

-- Less Precise Cost Modeling:

Being at the IR level creates a need to estimate the size cost of any given
instruction.

- How much does this imprecision affect the benefit estimation?

- Included in the data : Graph[1]: is the difference between our estimated
function size and the actual size in the binary. It shows that we get very
close and tend to be on the conservative side.

- Estimation causes the IR outliner to be conservative. Which means that we
are losing out on potential benefit by overestimating cost.

-- Higher Level of Abstraction:

- The outliners are essentially string matching algorithms. Being at a
higher level of abstraction naturally gives more opportunities for
equivalence. As an example, call instructions are handled naturally at the
IR level.

- Will a preRA outliner be able to have the same relaxation in congruence
matching? E.g will it be able to match tail and non tail function calls?

- Being at the IR level means that we lose out on some instruction lowering
idioms, e.g. constant expressions, bitwise rotation([shl, lshr, or] ->
[rot]), etc.

- This is evident in the results for test suite for aarch64, in which the
machine outliner outperforms the IR outliner due in part to the large
amount of global accesses in the tests.

-- Maintainability:

- The IR level in general is much more maintainable.

- We don’t have to be as conservative about certain ABI characteristics.
This allows for the IR outliner to work without the need for any extra
work(special options) from the users. For example, the machine outliner
requires ‘noredzone’ but the IR outliner does not.

-- Pipeline Flexibility:

- As shown in the performance data below, we can get up to 2x performance
by working pre function simplification. Though working pre simplification
means the outliner must gamble between the benefits of outlining vs
simplification.

-- Loss of control:

- The machine level can have more control over the outlining process. We
could have optimized parameterization, alignment handling, etc.

---- Adapting the algorithm to pre-RA IR ----

The analysis portion of the IR outliner is already IR agnostic for the most
part. It works on indices into the congruency vector for instructions and
their inputs/outputs. This would mean that a preRA outliner would only have
to define the MIR specific portions: Congruency detection, cost analysis,
parameter/output optimizations, and the outlining of beneficial candidates.

-- Implementation --

https://github.com/River707/llvm/blob/outliner/lib/Transforms/IPO/CodeSizeOutliner.cpp

All feedback/comments/discussion welcome and appreciated!

Thanks,

 River Riddle
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170905/c09eae6c/attachment-0001.html>