[llvm-dev] [RFC] PT.2 Add IR level interprocedural outliner for code size.

Thu Sep 21 20:02:24 PDT 2017

Hey Gerolf,

On Thu, Sep 21, 2017 at 7:10 PM, Gerolf Hoflehner <ghoflehner at apple.com>
wrote:

> In general I would love to see an outliner at the IR level also. But
> rather than a comparison vs. the machine outliner I would like to learn
> more about how the core data structures between the outliners will be
> shared.
>

The only structure that needs to be shared is a struct defining what an
outlining candidate looks like.

> In particular for matching/pruning it seems to be a reasonable approach.
>

When I get around to posting the real patches there are two utility
functions that allow for finding candidates and pruning a candidate list.
The only thing necessary to use these utilities is a vector containing the
unsigned value number of the instructions. All of the utilities are IR
agnostic.

> A few more remarks/questions are below also.
>
> Thanks
> Gerolf
>
>
> On Sep 5, 2017, at 4:16 PM, River Riddle via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Hey Everybody,
> A little while ago I posted an RFC(http://lists.llvm.org/pipe
> rmail/llvm-dev/2017-July/115666.html) with the proposition of adding a
> new outliner at the IR level. There was some confusion and many questions
> regarding the proposal which I’d like to address here:
>
> Note about nomenclature:
>    Candidate: A repeated sequence of instructions within a module.
>    Occurrence: One instance of a candidate sequence.
>
> -- Accompanied Graph Data --
>
> Graph data is referenced in the sections below, any reference to
> Graph[*Number*] is referencing the numbered graph in the following document:
>
> https://goo.gl/QDiVHU
>
> ---- Performance ----
>
> I have tested the IR outliner and current Machine outliner on a wide
> variety of benchmarks. The results include total % reduction of both
> geomean and total size. It also includes individual results for each test
> in each respective benchmark.
>
> The configurations tested are:
> ·       Early+Late IR outlining
> ·       Late IR outlining
> ·       Machine outlining
> ·       Early+Late+Machine outlining
> ·       Late+Machine outlining
>
> NOTE: For fairness in comparisons with the Machine Outliner, all IR
> outliner runs also include (-mno-red-zone, outlining from
> linkonce_odr/weak_odr functions).
>
> The code size benchmarking results provided are:
> * LLVM Test Suite
>
>    - X86_64, X86*, AArch64, Arm1176jzf-s*, Arm1176jzf-s-thumb*
>
> * Spec 2006
>
>    - X86_64, X86*, AArch64, Arm1176jzf-s*, Arm1176jzf-s-thumb*
>
> * Clang
>
>    - X86_64(Mac OS)
>
> * llvm-tblgen
>
>    - X86_64(Mac OS)
>
> * CSiBE
>
>    - AArch64
>
> * The machine outliner currently only supports X86_64 and AArch64.
>
> Full Code Size Results:
> https://goo.gl/ZBjHCG
>
> --- Algorithmic differences with the Machine Outliner ----
>
>           There was a lot of confusion on how exactly the algorithm I am
> proposing differs from what is available in the Machine Outliner. The
> similarities of the two outliners lie in the usage of a string matching
> algorithm and candidate pruning. The first step in the algorithm is to
> basically do the same common substring / pruning algorithm the post-RA MO
> uses but with a specially chosen congruence relation. I’d like to delve
> into the differences between the two:
>
> Congruence Detection:
>
> -        Machine Outliner
>   The machine outliner has the advantage of having this problem already
> taken care of by the register allocator, it simply checks to see if the two
> machine instructions are identical.
>
> -        IR Outliner
>   In the IR outliner we work on semantic equivalence, i.e. we care the
> operations being performed are equivalent and not the values. This creates
> a need to add verification that we do have exact equivalence when we need
> it, e.g. ShuffleVector’s shuffle mask, not taking the address of InlineASM,
> etc.
>
> A quick example of semantic equivalence:
> %1 = add i32 1, i32 2
> %2 = add i32 2, i32 3
>
> These two instructions are not identical because the values of the
> operands are not identical. They are, however, semantically equivalent
> because they both perform the add operation.
> This can be seen by simply removing the operand values used in the
> calculations:
> %1 = add i32 , i32
> %2 = add i32 , i32
>
> Occurrence Verification:
>
> -        Machine Outliner
>   At the post RA level you don’t need to do any kind of special
> verification for candidate occurrences because you don’t have to deal with
> the concept of inputs.
>
> -        IR Outliner
>  At the IR/preRA level we need to do complex verification to make sure
> that the occurrences within a candidate have the same internal inputs. If
> two occurrences have different internal inputs then we need some form of
> control flow to maintain correctness. By internal inputs I mean the
> operands of instructions that come from an instruction within the
> occurrence, e.g.
>
> %2 = …
> // Start outlining occurrence.
> %3 = …
> %4 = sub %3, %2 // The first operand is an internal input, the second is
> external.
>
> If there is any confusion about why we need control flow for internal
> inputs I am more than happy to provide examples and more detailed
> explanations.
>
> Aside from internal inputs we also need to verify that the functions we
> are outlining from have compatible attributes.
>
> Cost Modeling:
>
> -        Machine Outliner
> At the MIR level the cost information is extremely accurate. So cost
> modeling is composed of effectively counting the number of instructions and
> adding some frame/setup cost.
>
> -        IR Outliner
> At the IR level we are working with estimates for the costs of certain
> instructions. We try to match the IR cost to the MIR cost as closely as
> possible and in practice we can get fairly close(Graph[1]).
>   Taking this a step further we need to estimate the cost/setup of having
> x amount of parameters and y outputs, as well as the register pressure from
> both the call and the potentially outlined function.
>
> Parameterization Optimizations:
>
> -        Machine Outliner
> The Machine outliner uses exact equivalence, which does not allow for any
> form of parameterization.
>
> -        IR Outliner
>           Being at the IR level requires us to tackle parameterization,
> which then brings several optimizations to help lower the cost of
> parameterizing a sequence.
>
> * Constant Folding
>           The IR outliner will identify constant inputs and fold them.
>
> * Congruent Input Condensing
>           The outliner identifies the congruent sets of parameters for a
> function. Example:
>           void fn(int, int); ->  void fn(int);
>           fn(1, 1);           ->  fn(1);
>           fn(%1, %1);    ->  fn(%1);
> Parameters 1 and 2 were found to be the same for each callsite of the
> function, so we condensed the congruent parameters.
>
> * Input Partitioning
>           The outliner partitions candidates that have a parameter that
> can be constant folded. Example:
>           fn(1);
>           fn(1);
>           fn(%1);
> Occurrences 1 and 2 in the above candidate can have parameter 1 folded. We
> create a new candidate containing just occurrences 1 and 2 as it may be
> more profitable than the original candidate.
>
> * Constant int condensing
>           The outliner identifies constant int parameters and checks to
> see if, for each occurrence, they are an equal distance from other constant
> int parameters. If so it removes all but one of the parameters and
> represents the others as an add from the base. Example:
>
>           void fn(int a, int b);
>           fn(1, 2);
>           fn(3, 4);
>
> In the above, parameters 1 and 2 are always a distance of 1 apart. We can
> redefine our function as:
>           void fn(int a) {
>            int b = a + 1;
>            …
>           }
>
> Register Usage:
>
> -        Machine Outliner
> The MO works post RA with exact equivalence, so the most it will compute
> is if it needs to save the link register on arm64.
>
> -        IR Outliner
> The IR outliner needs to compute register usage for the new outlined
> function as well as the usage after generating a function call with x
> parameters and y outputs at each program point z.
>
> Outlining:
>
> -        Machine Outliner
>           At the MIR level we clone the outlined instructions into a new
> function, create some prologue/epilogue for the function, and then generate
> a call.
>
> -        IR Outliner
>           At the IR level we also have to handle the parameters/outputs
> of the candidate. Here we need to merge all of the metadata of outlined
> instructions/outlined functions. We also need to identify congruent sets of
> parameters between call sites and then folding the amount of parameters
> that are needed for the call.
>
> Suffix Array vs Suffix Tree+LCP:
>
>           The two structures should compute the same result, but there is
> a non obvious benefit that we get from the suffix array. With the suffix
> array approach we identify candidates that shares common occurrences albeit
> with a different length. This is very useful for complex
> verification/analysis, e.g. at the IR or pre RA level. This allows us to
> cache the work when we calculating inputs or verifying the internal inputs
> of occurrences. Although this won't be an issue if/when we switch to a
> common interface for candidate selection.
>
> ---- A replacement for the Machine Outliner? Not exactly ----
>
> The IR outliner was never intended as a replacement for the machine
> outliner and the two can coexist. The outliners tend to catch very
> different cases: the machine outliner tends to favor very small candidate
> lengths. Using a build of llvm-tblgen, the machine outliner gets ~52% of
> its benefit from outlined functions of 2-3 instructions. The IR outliner
> tends to favor large candidate lengths(2-20+), often composed of function
> calls. 52% of the benefit for the IR outliner in the llvm-tblgen example is
> found in outlined functions with final lengths up to 17. Data for example
> runs of both can be found in the graph data file and is summarized in
> Graph[2].
>
>  Included in the performance data are metrics showing the performance of
> using both the IR outliner and machine outliner. The data indicates that
> you can achieve up to, and exceed, 2% reduction of both geomean and total
> size by using both.
>
> ---- Pros/Cons of IR----
>
> The current algorithm is implemented at the IR level, but there are trade
> offs to placing this transformation anywhere in the
> pipeline(IR/preRa/postRA).
>
> -- Less Precise Cost Modeling:
> Being at the IR level creates a need to estimate the size cost of any
> given instruction.
> - How much does this imprecision affect the benefit estimation?
> - Included in the data : Graph[1]: is the difference between our estimated
> function size and the actual size in the binary. It shows that we get very
> close and tend to be on the conservative side.
> - Estimation causes the IR outliner to be conservative. Which means that
> we are losing out on potential benefit by overestimating cost.
>
> -- Higher Level of Abstraction:
> - The outliners are essentially string matching algorithms. Being at a
> higher level of abstraction naturally gives more opportunities for
> equivalence. As an example, call instructions are handled naturally at the
> IR level.
> - Will a preRA outliner be able to have the same relaxation in congruence
> matching? E.g will it be able to match tail and non tail function calls?
> - Being at the IR level means that we lose out on some instruction
> lowering idioms, e.g. constant expressions, bitwise rotation([shl, lshr,
> or] -> [rot]), etc.
> - This is evident in the results for test suite for aarch64, in which the
> machine outliner outperforms the IR outliner due in part to the large
> amount of global accesses in the tests.
>
> -- Maintainability:
> - The IR level in general is much more maintainable.
>
> Why so? How did you measure that? What is your measure for “maintainable?"
>

The IR level, currently, has much better documentation, is more mature, and
AFAIK most developers are familiar with it. It's definitely subjective to
some degree.

>
> - We don’t have to be as conservative about certain ABI characteristics.
> This allows for the IR outliner to work without the need for any extra
> work(special options) from the users. For example, the machine outliner
> requires ‘noredzone’ but the IR outliner does not.
>
> -- Pipeline Flexibility:
> - As shown in the performance data below, we can get up to 2x performance
> by working pre function simplification. Though working pre simplification
> means the outliner must gamble between the benefits of outlining vs
> simplification.
>
> -- Loss of control:
> - The machine level can have more control over the outlining process. We
> could have optimized parameterization, alignment handling, etc.
>
> ---- Adapting the algorithm to pre-RA IR ----
> The analysis portion of the IR outliner is already IR agnostic for the
> most part. It works on indices into the congruency vector for instructions
> and their inputs/outputs. This would mean that a preRA outliner would only
> have to define the MIR specific portions: Congruency detection, cost
> analysis, parameter/output optimizations, and the outlining of beneficial
> candidates.
>
> That should apply the other way around too: take the MO outliner and
> adapt. No?
>

This is one of the things that the comparison helps to illustrate. If you
take the utilities that I mentioned above and port the current MO outliner
to use them, the resultant file is < 300 lines. It works on exact
equivalence, so all of the interesting parts of the algorithm, besides
legality detection and frame cost, are already taken care of by register
allocation. The only remaining parts are the value numbering and outlining,
but those are IR specific and have to be rewritten anyways.

 The IR implementation started at roughly this point, but exact equivalence
at the IR level doesn't really amount to much. This outliner adds the logic
for verification/analysis/optimization for when we don't have exact
equivalence. It's better to start from here because these are the real
problems that are going to have to be solved if an outliner is to exist
anywhere other than post RA.

Thanks,
 River Riddle

>
> -- Implementation --
> https://github.com/River707/llvm/blob/outliner/lib/Transform
> s/IPO/CodeSizeOutliner.cpp
>
> All feedback/comments/discussion welcome and appreciated!
>
> Thanks,
>  River Riddle
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170921/efaea5bc/attachment-0001.html>