[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

Sat Aug 27 13:38:47 PDT 2016

Thanks for a well written and informative writeup.  This was a pleasure 
to read and laid out the issues well for consideration.

On a technical level, I could see this having a secondary use for 
developers of the compiler itself.  Seeing the same pattern of assembly 
many times across a large binary is often a hint of a missed 
optimization.  Using this type of technique to generate candidates for 
further investigation seems interesting.

Philip

On 08/26/2016 02:26 PM, Jessica Paquette via llvm-dev wrote:
> Hi everyone,
>
> Since I haven't said anything on the mailing list before, a quick
> introduction. I'm an intern at Apple, and over the summer I implemented a
> prototype for an outlining pass for code size in LLVM. Now I'm seeking to
> eventually upstream it. I have the preliminary code on GitHub right now,
> but it's still very prototypical (see the code section).
>
> ================================
> Motivation
> ================================
> The goal of the internship was to create an interprocedural pass that
> would reduce code size as much as possible, perhaps at the cost of some
> performance. This would be useful to, say, embedded programmers who only
> have a few kilobytes to work with and a substantial amount of code to fit
> in that space.
>
>
> ================================
> Approach and Initial Results
> ================================
> To do this, we chose to implement an outliner. Outliners find sequences of
> instructions which would be better off as a function call, by some measure
> of "better". In this case, the measure of "better" is "makes code
> smaller".
>
>
> ================================
> Results
> ================================
> These results are from a fairly recent 64-bit Intel processor, using a
> version of Clang equipped with the outliner prototype versus an equivalent
> version of Clang without the outliner.
>
> CODE SIZE REDUCTION
> For tests >=4 Kb in non-outlined size, the outliner currently provides an
> average of 12.94% code size reduction on the LLVM test suite in comparison
> to a default Clang, up to 51.4% code size reduction. In comparison to a
> Clang with -Oz, the outliner provides an average of a 1.29% code size
> reduction, up to a 37% code size reduction. I believe that the -Oz numbers
> can be further improved by better tuning the outlining cost model.
>
> EXECUTION TIME IMPACT
> On average, the outliner increases execution time by 2% on the LLVM test
> suite, but has been also shown to improve exection time by up to 16%.
> These results were from a fairly recent Intel processor, so the results
> may vary. Recent Intel processors have very low latency for function
> calls, which may impact these results. Execution time improvements are
> likely dependent on the latency of function calls, instruction caching
> behaviour, and the execution frequency of the code being outlined. In
> partucular, using a processor with heavy function call latency will likely
> increase execution time overhead.
>
>
> ================================
> Implementation
> ================================
> The outliner, in its current state, is a MachineModulePass. It finds
> *identical* sequences of MIR, after register allocation, and pulls them
> out into their own functions. Thus, it's effectively assembly-level.
> Ultimately, the algorithm used is general, so it can sit anywhere, but MIR
> was very convenient for the time being.
>
> It requires two data structures.
>
> 1. A generalized suffix tree
> 2. A "terminated string"
>
> 1: The generalized suffix tree is constructed using Ukkonen's linear time
> construction algorithm [1]. They require linear space and support
> linear-time substring queries. In practice, the construction time for the
> suffix tree is the most time consuming part, but I haven't noticed a
> difference in compilation time on, say, 12 MB .ll files.
>
> 2: To support the suffix tree, we need a "terminated string." This is a
> generalized string with an unique terminator character appended to the
> end. TerminatedStrings can be built from any type.
>
> The algorithm is then roughly as follows.
>
> 1. For each MachineBasicBlock in the program, build a TerminatedString for
> that block.
> 2. Build a suffix tree for that collection of strings.
> 3. Query the suffix tree for the longest repeated substring and place that
> string in a candidate list. Repeat until none are found.
> 4. Create functions for each candidate.
> 5. Replace each candidate with a call to its respective function.
>
> Currently, the program itself isn't stored in the suffix tree, but rather
> a "proxy string" of integers. This isn't necessary at the MIR level, but
> may be for an IR level extension of the algorithm.
>
>
> ================================
> Challenges
> ================================
> 1) MEMORY CONSUMPTION
> Given a string of length n, a naive suffix tree implementation can take up
> to 40n bytes of memory. However, this number can be reduced to 20n with a
> bit of work [2]. Currently, the suffix tree stores the entire program,
> including instructions which really ought not to be outlined, such as
> returns. These instructions should not be included in the final
> implementation, but should rather act as terminators for the strings. This
> will likely curb memory consumption. Suffix trees have been used in the
> past in sliding-window-based compression schemes, which may serve as a
> source of inspiration for reducing memory overhead.[3]
>
> Nonetheless, the outliner probably shouldn't be run unless it really has
> to be run. It will likely be used mostly in embedded spaces, where the
> programs have to fit into small devices anyway. Thus, memory overhead for
> the compiler likely won't be a problem. The outliner should only be used
> in -Oz compilations, and possibly should have its own flag.
>
>
> 2) EXECUTION TIME
> Currently, the outliner isn't tuned for preventing execution time
> increases. There is an average of a 2% execution time hit on the tests in
> the LLVM test suite, with a few outliers of up to 30%. The outliers are
> tests which contain hot loops. The outliner really ought to be able to use
> profiling information and not outline from hot areas. Another suggestion
> people have given me is to add a "never outline" directive which would
> allow the user to say something along the lines of "this is a hot loop,
> please never outline from it".
>
> It's also important to note that these numbers are coming from a fairly
> recent Intel processor.
>
>
> 3) CONSTRAINTS ON INSTRUCTIONS
> The outliner currently won't pull anything out of functions which use a
> red zone. It also won't pull anything out that uses the stack, instruction
> pointer, uses constant pool indices, CFI indices, jump table indices, or
> frame indices. This removes many opportunities for outlining which would
> likely be available at a higher level (such as IR). Thus, there's a case
> for moving this up to a higher level.
>
>
> ================================
> Additional Applications
> ================================
> The suffix tree itself could be used as a tool for finding opportunities
> to refactor code. For example, it could recognize places where the user
> likely copied and pasted some code. This could be run on codebases to find
> areas where people could manually outline things at the source level.
>
> Using the terminated string class, it would also be possible to implement
> other string algorithms on code. This may open the door to new ways to
> analyze existing codebases.
>
>
> ================================
> Roadmap
> ================================
> The current outliner is *very* prototypical. The version I would want to
> upstream would be a new implementation. Here's what I'd like to address
> and accomplish.
>
> 1. Ask "what does the LLVM community want from an outliner" and use that
> to drive development of the algorithm.
> 2. Reimplement the outliner, perhaps using a less memory-intensve data
> structure like a suffix array.
> 3. Begin adding features to the algorithm, for example:
>      i.   Teaching the algorithm about hot/cold blocks of code and taking
> that into account.
>      ii.  Simple parameter passing.
>      iii. Similar function outlining-- eg, noticing that two outlining
> candidates are similar and can be merged into one function with some
> control flow.
>
>
> ================================
> Code
> ================================
> Note: This code requires MachineModulePasses
>
> * Main pass:
> https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h
>
> * Suffix tree:
> https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h
>
> * TerminatedString and TerminatedStringList:
> https://github.com/ornata/llvm/blob/master/include/llvm/ADT/TerminatedString.h
>
> Here are a couple unit tests for the data structures.
>
> * Suffix tree unit tests:
> https://github.com/ornata/llvm/blob/master/unittests/ADT/SuffixTreeTest.cpp
>
> * TerminatedString unit tests:
> https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringTest.cpp
>
> * TerminatedStringList unit tests:
> https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringListTest.cpp
>
>
> ================================
> References
> ================================
> [1] Ukkonen's Algorithm:
> https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
> [2] Suffix Trees and Suffix Arrays:
> http://web.cs.iastate.edu/~cs548/suffix.pdf
> [3] Extended Application of Suffix Trees to Data Compression:
> http://www.larsson.dogma.net/dccpaper.pdf
>
>
> Thanks for reading,
> Jessica
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev