<div dir="ltr"><div><div><div><div>Thanks a bunch! I'll try to clarify a bit about why I did what I did.<br><br></div><div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><div>In order to transform it to a *practically* useful optimization pass, though, "results" section should be strenghtened up a bit:<br></div>*
You simply say "tests >= 4 Kb", without specifying the nature of
these tests at all. Please tell exactly, along with exact compiler
options you used. If the tests are artificial, they might be good for
debugging, but worthless as a measure of usefulness of your pass<br></blockquote><br>Definitely agree!<br><br></div>The tests that I used were the C/C++ programs in the LLVM test suite, with externals enabled. What I did was run clang -mno-red-zone against clang -mno-red-zone with outlining, and clang -mno-red-zone -Oz against clang -mno-red-zone with outlining. What would be *better* would be to run it on some real embedded cases versus the programs in the test suite. What I found in the test suite is there are a few programs which do *really* well (like, say an 80% reduction!), but it turns out that those programs are typically entirely macros.<br><br></div><div>(Some other interesting test cases, less relevant to embedded, might be programs with heavy template usage.)<br></div><div><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">* You use "a fairly recent 64-bit Intel processor" to test an
embedded-targeting optimization, which is odd. If you want to target x86
(good choice! :-)), perhaps a Quark chip might be a better choice? You
can measure code size impact without using a real hardware -- just add
"-triple i586-intel-elfiamcu" option.<br></blockquote><br></div><div>The reason I started out using an Intel processor was just to verify that I could even *run* outlined programs in the first place. I have been working on extending it to ARM, which would be a far more interesting for embedded than x86-64. That being said, this *can* potentially be useful on non-embedded computers but the benefits be much greater in an embedded space.<br><br></div><div>(I also didn't know about that flag, so thanks!)<br></div><div><br></div><div>- Jessica<br></div></div></div><div><div><div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Aug 29, 2016 at 12:20 PM, Andrey Bokhanko <span dir="ltr"><<a href="mailto:andreybokhanko@gmail.com" target="_blank">andreybokhanko@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div>Hi Jessica,<br><br></div>Let me echo others' comments -- this is a very interesting research project and a nice write-up, indeed!<br><br></div><div>In order to transform it to a *practically* useful optimization pass, though, "results" section should be strenghtened up a bit:<br></div><div>* You simply say "tests >= 4 Kb", without specifying the nature of these tests at all. Please tell exactly, along with exact compiler options you used. If the tests are artificial, they might be good for debugging, but worthless as a measure of usefulness of your pass.<br></div><div>* You use "a fairly recent 64-bit Intel processor" to test an embedded-targeting optimization, which is odd. If you want to target x86 (good choice! :-)), perhaps a Quark chip might be a better choice? You can measure code size impact without using a real hardware -- just add "-triple i586-intel-elfiamcu" option.<br></div><div><br></div>Yours,<br></div>Andrey<br><br></div><div class="gmail-HOEnZb"><div class="gmail-h5"><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Aug 27, 2016 at 12:26 AM, Jessica Paquette via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi everyone,<br>
<br>
Since I haven't said anything on the mailing list before, a quick<br>
introduction. I'm an intern at Apple, and over the summer I implemented a<br>
prototype for an outlining pass for code size in LLVM. Now I'm seeking to<br>
eventually upstream it. I have the preliminary code on GitHub right now,<br>
but it's still very prototypical (see the code section).<br>
<br>
==============================<wbr>==<br>
Motivation<br>
==============================<wbr>==<br>
The goal of the internship was to create an interprocedural pass that<br>
would reduce code size as much as possible, perhaps at the cost of some<br>
performance. This would be useful to, say, embedded programmers who only<br>
have a few kilobytes to work with and a substantial amount of code to fit<br>
in that space.<br>
<br>
<br>
==============================<wbr>==<br>
Approach and Initial Results<br>
==============================<wbr>==<br>
To do this, we chose to implement an outliner. Outliners find sequences of<br>
instructions which would be better off as a function call, by some measure<br>
of "better". In this case, the measure of "better" is "makes code<br>
smaller".<br>
<br>
<br>
==============================<wbr>==<br>
Results<br>
==============================<wbr>==<br>
These results are from a fairly recent 64-bit Intel processor, using a<br>
version of Clang equipped with the outliner prototype versus an equivalent<br>
version of Clang without the outliner.<br>
<br>
CODE SIZE REDUCTION<br>
For tests >=4 Kb in non-outlined size, the outliner currently provides an<br>
average of 12.94% code size reduction on the LLVM test suite in comparison<br>
to a default Clang, up to 51.4% code size reduction. In comparison to a<br>
Clang with -Oz, the outliner provides an average of a 1.29% code size<br>
reduction, up to a 37% code size reduction. I believe that the -Oz numbers<br>
can be further improved by better tuning the outlining cost model.<br>
<br>
EXECUTION TIME IMPACT<br>
On average, the outliner increases execution time by 2% on the LLVM test<br>
suite, but has been also shown to improve exection time by up to 16%.<br>
These results were from a fairly recent Intel processor, so the results<br>
may vary. Recent Intel processors have very low latency for function<br>
calls, which may impact these results. Execution time improvements are<br>
likely dependent on the latency of function calls, instruction caching<br>
behaviour, and the execution frequency of the code being outlined. In<br>
partucular, using a processor with heavy function call latency will likely<br>
increase execution time overhead.<br>
<br>
<br>
==============================<wbr>==<br>
Implementation<br>
==============================<wbr>==<br>
The outliner, in its current state, is a MachineModulePass. It finds<br>
*identical* sequences of MIR, after register allocation, and pulls them<br>
out into their own functions. Thus, it's effectively assembly-level.<br>
Ultimately, the algorithm used is general, so it can sit anywhere, but MIR<br>
was very convenient for the time being.<br>
<br>
It requires two data structures.<br>
<br>
1. A generalized suffix tree<br>
2. A "terminated string"<br>
<br>
1: The generalized suffix tree is constructed using Ukkonen's linear time<br>
construction algorithm [1]. They require linear space and support<br>
linear-time substring queries. In practice, the construction time for the<br>
suffix tree is the most time consuming part, but I haven't noticed a<br>
difference in compilation time on, say, 12 MB .ll files.<br>
<br>
2: To support the suffix tree, we need a "terminated string." This is a<br>
generalized string with an unique terminator character appended to the<br>
end. TerminatedStrings can be built from any type.<br>
<br>
The algorithm is then roughly as follows.<br>
<br>
1. For each MachineBasicBlock in the program, build a TerminatedString for<br>
that block.<br>
2. Build a suffix tree for that collection of strings.<br>
3. Query the suffix tree for the longest repeated substring and place that<br>
string in a candidate list. Repeat until none are found.<br>
4. Create functions for each candidate.<br>
5. Replace each candidate with a call to its respective function.<br>
<br>
Currently, the program itself isn't stored in the suffix tree, but rather<br>
a "proxy string" of integers. This isn't necessary at the MIR level, but<br>
may be for an IR level extension of the algorithm.<br>
<br>
<br>
==============================<wbr>==<br>
Challenges<br>
==============================<wbr>==<br>
1) MEMORY CONSUMPTION<br>
Given a string of length n, a naive suffix tree implementation can take up<br>
to 40n bytes of memory. However, this number can be reduced to 20n with a<br>
bit of work [2]. Currently, the suffix tree stores the entire program,<br>
including instructions which really ought not to be outlined, such as<br>
returns. These instructions should not be included in the final<br>
implementation, but should rather act as terminators for the strings. This<br>
will likely curb memory consumption. Suffix trees have been used in the<br>
past in sliding-window-based compression schemes, which may serve as a<br>
source of inspiration for reducing memory overhead.[3]<br>
<br>
Nonetheless, the outliner probably shouldn't be run unless it really has<br>
to be run. It will likely be used mostly in embedded spaces, where the<br>
programs have to fit into small devices anyway. Thus, memory overhead for<br>
the compiler likely won't be a problem. The outliner should only be used<br>
in -Oz compilations, and possibly should have its own flag.<br>
<br>
<br>
2) EXECUTION TIME<br>
Currently, the outliner isn't tuned for preventing execution time<br>
increases. There is an average of a 2% execution time hit on the tests in<br>
the LLVM test suite, with a few outliers of up to 30%. The outliers are<br>
tests which contain hot loops. The outliner really ought to be able to use<br>
profiling information and not outline from hot areas. Another suggestion<br>
people have given me is to add a "never outline" directive which would<br>
allow the user to say something along the lines of "this is a hot loop,<br>
please never outline from it".<br>
<br>
It's also important to note that these numbers are coming from a fairly<br>
recent Intel processor.<br>
<br>
<br>
3) CONSTRAINTS ON INSTRUCTIONS<br>
The outliner currently won't pull anything out of functions which use a<br>
red zone. It also won't pull anything out that uses the stack, instruction<br>
pointer, uses constant pool indices, CFI indices, jump table indices, or<br>
frame indices. This removes many opportunities for outlining which would<br>
likely be available at a higher level (such as IR). Thus, there's a case<br>
for moving this up to a higher level.<br>
<br>
<br>
==============================<wbr>==<br>
Additional Applications<br>
==============================<wbr>==<br>
The suffix tree itself could be used as a tool for finding opportunities<br>
to refactor code. For example, it could recognize places where the user<br>
likely copied and pasted some code. This could be run on codebases to find<br>
areas where people could manually outline things at the source level.<br>
<br>
Using the terminated string class, it would also be possible to implement<br>
other string algorithms on code. This may open the door to new ways to<br>
analyze existing codebases.<br>
<br>
<br>
==============================<wbr>==<br>
Roadmap<br>
==============================<wbr>==<br>
The current outliner is *very* prototypical. The version I would want to<br>
upstream would be a new implementation. Here's what I'd like to address<br>
and accomplish.<br>
<br>
1. Ask "what does the LLVM community want from an outliner" and use that<br>
to drive development of the algorithm.<br>
2. Reimplement the outliner, perhaps using a less memory-intensve data<br>
structure like a suffix array.<br>
3. Begin adding features to the algorithm, for example:<br>
i. Teaching the algorithm about hot/cold blocks of code and taking<br>
that into account.<br>
ii. Simple parameter passing.<br>
iii. Similar function outlining-- eg, noticing that two outlining<br>
candidates are similar and can be merged into one function with some<br>
control flow.<br>
<br>
<br>
==============================<wbr>==<br>
Code<br>
==============================<wbr>==<br>
Note: This code requires MachineModulePasses<br>
<br>
* Main pass:<br>
<a href="https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h" rel="noreferrer" target="_blank">https://github.com/ornata/llvm<wbr>/blob/master/lib/CodeGen/Machi<wbr>neOutliner.h</a><br>
<br>
* Suffix tree:<br>
<a href="https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h" rel="noreferrer" target="_blank">https://github.com/ornata/llvm<wbr>/blob/master/include/llvm/ADT/<wbr>SuffixTree.h</a><br>
<br>
* TerminatedString and TerminatedStringList:<br>
<a href="https://github.com/ornata/llvm/blob/master/include/llvm/ADT/TerminatedString.h" rel="noreferrer" target="_blank">https://github.com/ornata/llvm<wbr>/blob/master/include/llvm/ADT/<wbr>TerminatedString.h</a><br>
<br>
Here are a couple unit tests for the data structures.<br>
<br>
* Suffix tree unit tests:<br>
<a href="https://github.com/ornata/llvm/blob/master/unittests/ADT/SuffixTreeTest.cpp" rel="noreferrer" target="_blank">https://github.com/ornata/llvm<wbr>/blob/master/unittests/ADT/<wbr>SuffixTreeTest.cpp</a><br>
<br>
* TerminatedString unit tests:<br>
<a href="https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringTest.cpp" rel="noreferrer" target="_blank">https://github.com/ornata/llvm<wbr>/blob/master/unittests/ADT/<wbr>TerminatedStringTest.cpp</a><br>
<br>
* TerminatedStringList unit tests:<br>
<a href="https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringListTest.cpp" rel="noreferrer" target="_blank">https://github.com/ornata/llvm<wbr>/blob/master/unittests/ADT/<wbr>TerminatedStringListTest.cpp</a><br>
<br>
<br>
==============================<wbr>==<br>
References<br>
==============================<wbr>==<br>
[1] Ukkonen's Algorithm:<br>
<a href="https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf" rel="noreferrer" target="_blank">https://www.cs.helsinki.fi/u/u<wbr>kkonen/SuffixT1withFigs.pdf</a><br>
[2] Suffix Trees and Suffix Arrays:<br>
<a href="http://web.cs.iastate.edu/%7Ecs548/suffix.pdf" rel="noreferrer" target="_blank">http://web.cs.iastate.edu/~cs5<wbr>48/suffix.pdf</a><br>
[3] Extended Application of Suffix Trees to Data Compression:<br>
<a href="http://www.larsson.dogma.net/dccpaper.pdf" rel="noreferrer" target="_blank">http://www.larsson.dogma.net/d<wbr>ccpaper.pdf</a><br>
<br>
<br>
Thanks for reading,<br>
Jessica<br>
<br>
______________________________<wbr>_________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-dev</a><br>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div></div></div></div></div>