[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

Mon Aug 29 23:05:24 PDT 2016

Thanks a bunch! I'll try to clarify a bit about why I did what I did.

In order to transform it to a *practically* useful optimization pass,
> though, "results" section should be strenghtened up a bit:
> * You simply say "tests >= 4 Kb", without specifying the nature of these
> tests at all. Please tell exactly, along with exact compiler options you
> used. If the tests are artificial, they might be good for debugging, but
> worthless as a measure of usefulness of your pass
>

Definitely agree!

The tests that I used were the C/C++ programs in the LLVM test suite, with
externals enabled. What I did was run clang -mno-red-zone against clang
-mno-red-zone with outlining, and clang -mno-red-zone -Oz against clang
-mno-red-zone with outlining. What would be *better* would be to run it on
some real embedded cases versus the programs in the test suite. What I
found in the test suite is there are a few programs which do *really* well
(like, say an 80% reduction!), but it turns out that those programs are
typically entirely macros.

(Some other interesting test cases, less relevant to embedded, might be
programs with heavy template usage.)

* You use "a fairly recent 64-bit Intel processor" to test an
> embedded-targeting optimization, which is odd. If you want to target x86
> (good choice! :-)), perhaps a Quark chip might be a better choice? You can
> measure code size impact without using a real hardware -- just add "-triple
> i586-intel-elfiamcu" option.
>

The reason I started out using an Intel processor was just to verify that I
could even *run* outlined programs in the first place. I have been working
on extending it to ARM, which would be a far more interesting for embedded
than x86-64. That being said, this *can* potentially be useful on
non-embedded computers but the benefits be much greater in an embedded
space.

(I also didn't know about that flag, so thanks!)

- Jessica

On Mon, Aug 29, 2016 at 12:20 PM, Andrey Bokhanko <andreybokhanko at gmail.com>
wrote:

> Hi Jessica,
>
> Let me echo others' comments -- this is a very interesting research
> project and a nice write-up, indeed!
>
> In order to transform it to a *practically* useful optimization pass,
> though, "results" section should be strenghtened up a bit:
> * You simply say "tests >= 4 Kb", without specifying the nature of these
> tests at all. Please tell exactly, along with exact compiler options you
> used. If the tests are artificial, they might be good for debugging, but
> worthless as a measure of usefulness of your pass.
> * You use "a fairly recent 64-bit Intel processor" to test an
> embedded-targeting optimization, which is odd. If you want to target x86
> (good choice! :-)), perhaps a Quark chip might be a better choice? You can
> measure code size impact without using a real hardware -- just add "-triple
> i586-intel-elfiamcu" option.
>
> Yours,
> Andrey
>
>
> On Sat, Aug 27, 2016 at 12:26 AM, Jessica Paquette via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi everyone,
>>
>> Since I haven't said anything on the mailing list before, a quick
>> introduction. I'm an intern at Apple, and over the summer I implemented a
>> prototype for an outlining pass for code size in LLVM. Now I'm seeking to
>> eventually upstream it. I have the preliminary code on GitHub right now,
>> but it's still very prototypical (see the code section).
>>
>> ================================
>> Motivation
>> ================================
>> The goal of the internship was to create an interprocedural pass that
>> would reduce code size as much as possible, perhaps at the cost of some
>> performance. This would be useful to, say, embedded programmers who only
>> have a few kilobytes to work with and a substantial amount of code to fit
>> in that space.
>>
>>
>> ================================
>> Approach and Initial Results
>> ================================
>> To do this, we chose to implement an outliner. Outliners find sequences of
>> instructions which would be better off as a function call, by some measure
>> of "better". In this case, the measure of "better" is "makes code
>> smaller".
>>
>>
>> ================================
>> Results
>> ================================
>> These results are from a fairly recent 64-bit Intel processor, using a
>> version of Clang equipped with the outliner prototype versus an equivalent
>> version of Clang without the outliner.
>>
>> CODE SIZE REDUCTION
>> For tests >=4 Kb in non-outlined size, the outliner currently provides an
>> average of 12.94% code size reduction on the LLVM test suite in comparison
>> to a default Clang, up to 51.4% code size reduction. In comparison to a
>> Clang with -Oz, the outliner provides an average of a 1.29% code size
>> reduction, up to a 37% code size reduction. I believe that the -Oz numbers
>> can be further improved by better tuning the outlining cost model.
>>
>> EXECUTION TIME IMPACT
>> On average, the outliner increases execution time by 2% on the LLVM test
>> suite, but has been also shown to improve exection time by up to 16%.
>> These results were from a fairly recent Intel processor, so the results
>> may vary. Recent Intel processors have very low latency for function
>> calls, which may impact these results. Execution time improvements are
>> likely dependent on the latency of function calls, instruction caching
>> behaviour, and the execution frequency of the code being outlined. In
>> partucular, using a processor with heavy function call latency will likely
>> increase execution time overhead.
>>
>>
>> ================================
>> Implementation
>> ================================
>> The outliner, in its current state, is a MachineModulePass. It finds
>> *identical* sequences of MIR, after register allocation, and pulls them
>> out into their own functions. Thus, it's effectively assembly-level.
>> Ultimately, the algorithm used is general, so it can sit anywhere, but MIR
>> was very convenient for the time being.
>>
>> It requires two data structures.
>>
>> 1. A generalized suffix tree
>> 2. A "terminated string"
>>
>> 1: The generalized suffix tree is constructed using Ukkonen's linear time
>> construction algorithm [1]. They require linear space and support
>> linear-time substring queries. In practice, the construction time for the
>> suffix tree is the most time consuming part, but I haven't noticed a
>> difference in compilation time on, say, 12 MB .ll files.
>>
>> 2: To support the suffix tree, we need a "terminated string." This is a
>> generalized string with an unique terminator character appended to the
>> end. TerminatedStrings can be built from any type.
>>
>> The algorithm is then roughly as follows.
>>
>> 1. For each MachineBasicBlock in the program, build a TerminatedString for
>> that block.
>> 2. Build a suffix tree for that collection of strings.
>> 3. Query the suffix tree for the longest repeated substring and place that
>> string in a candidate list. Repeat until none are found.
>> 4. Create functions for each candidate.
>> 5. Replace each candidate with a call to its respective function.
>>
>> Currently, the program itself isn't stored in the suffix tree, but rather
>> a "proxy string" of integers. This isn't necessary at the MIR level, but
>> may be for an IR level extension of the algorithm.
>>
>>
>> ================================
>> Challenges
>> ================================
>> 1) MEMORY CONSUMPTION
>> Given a string of length n, a naive suffix tree implementation can take up
>> to 40n bytes of memory. However, this number can be reduced to 20n with a
>> bit of work [2]. Currently, the suffix tree stores the entire program,
>> including instructions which really ought not to be outlined, such as
>> returns. These instructions should not be included in the final
>> implementation, but should rather act as terminators for the strings. This
>> will likely curb memory consumption. Suffix trees have been used in the
>> past in sliding-window-based compression schemes, which may serve as a
>> source of inspiration for reducing memory overhead.[3]
>>
>> Nonetheless, the outliner probably shouldn't be run unless it really has
>> to be run. It will likely be used mostly in embedded spaces, where the
>> programs have to fit into small devices anyway. Thus, memory overhead for
>> the compiler likely won't be a problem. The outliner should only be used
>> in -Oz compilations, and possibly should have its own flag.
>>
>>
>> 2) EXECUTION TIME
>> Currently, the outliner isn't tuned for preventing execution time
>> increases. There is an average of a 2% execution time hit on the tests in
>> the LLVM test suite, with a few outliers of up to 30%. The outliers are
>> tests which contain hot loops. The outliner really ought to be able to use
>> profiling information and not outline from hot areas. Another suggestion
>> people have given me is to add a "never outline" directive which would
>> allow the user to say something along the lines of "this is a hot loop,
>> please never outline from it".
>>
>> It's also important to note that these numbers are coming from a fairly
>> recent Intel processor.
>>
>>
>> 3) CONSTRAINTS ON INSTRUCTIONS
>> The outliner currently won't pull anything out of functions which use a
>> red zone. It also won't pull anything out that uses the stack, instruction
>> pointer, uses constant pool indices, CFI indices, jump table indices, or
>> frame indices. This removes many opportunities for outlining which would
>> likely be available at a higher level (such as IR). Thus, there's a case
>> for moving this up to a higher level.
>>
>>
>> ================================
>> Additional Applications
>> ================================
>> The suffix tree itself could be used as a tool for finding opportunities
>> to refactor code. For example, it could recognize places where the user
>> likely copied and pasted some code. This could be run on codebases to find
>> areas where people could manually outline things at the source level.
>>
>> Using the terminated string class, it would also be possible to implement
>> other string algorithms on code. This may open the door to new ways to
>> analyze existing codebases.
>>
>>
>> ================================
>> Roadmap
>> ================================
>> The current outliner is *very* prototypical. The version I would want to
>> upstream would be a new implementation. Here's what I'd like to address
>> and accomplish.
>>
>> 1. Ask "what does the LLVM community want from an outliner" and use that
>> to drive development of the algorithm.
>> 2. Reimplement the outliner, perhaps using a less memory-intensve data
>> structure like a suffix array.
>> 3. Begin adding features to the algorithm, for example:
>>     i.   Teaching the algorithm about hot/cold blocks of code and taking
>> that into account.
>>     ii.  Simple parameter passing.
>>     iii. Similar function outlining-- eg, noticing that two outlining
>> candidates are similar and can be merged into one function with some
>> control flow.
>>
>>
>> ================================
>> Code
>> ================================
>> Note: This code requires MachineModulePasses
>>
>> * Main pass:
>> https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h
>>
>> * Suffix tree:
>> https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h
>>
>> * TerminatedString and TerminatedStringList:
>> https://github.com/ornata/llvm/blob/master/include/llvm/ADT/
>> TerminatedString.h
>>
>> Here are a couple unit tests for the data structures.
>>
>> * Suffix tree unit tests:
>> https://github.com/ornata/llvm/blob/master/unittests/ADT/
>> SuffixTreeTest.cpp
>>
>> * TerminatedString unit tests:
>> https://github.com/ornata/llvm/blob/master/unittests/ADT/
>> TerminatedStringTest.cpp
>>
>> * TerminatedStringList unit tests:
>> https://github.com/ornata/llvm/blob/master/unittests/ADT/
>> TerminatedStringListTest.cpp
>>
>>
>> ================================
>> References
>> ================================
>> [1] Ukkonen's Algorithm:
>> https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
>> [2] Suffix Trees and Suffix Arrays:
>> http://web.cs.iastate.edu/~cs548/suffix.pdf
>> [3] Extended Application of Suffix Trees to Data Compression:
>> http://www.larsson.dogma.net/dccpaper.pdf
>>
>>
>> Thanks for reading,
>> Jessica
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160829/1811f817/attachment.html>