[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

Fri Aug 26 14:55:29 PDT 2016

I think the "Motivation" section explained that. I too first thought about
"why not at IR?" but the reason looks like MIR, post-RA has the most
accurate heuristics (best way to know looks like actually getting there).

Do you know if there is any experimental pass that relies on deriving
heuristics by a feedback loop after letting, ie. a duplicate
module/function/block continue past?

Regards,
Kevin

On 26 August 2016 at 14:36, Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org
> wrote:

> Hi Jessica,
>
> This is quite interesting.
>
> Can you comment on why you started by doing this at the MI level, as
> opposed to the IR level. And at the MI level, why after RA instead of
> before RA?
>
> Perhaps the first question is another way of asking about how accurately
> we can model call-site code size at the IR level?
>
> Thanks in advance,
> Hal
>
> ----- Original Message -----
> > From: "Jessica Paquette via llvm-dev" <llvm-dev at lists.llvm.org>
> > To: llvm-dev at lists.llvm.org
> > Sent: Friday, August 26, 2016 4:26:09 PM
> > Subject: [llvm-dev] [RFC] Interprocedural MIR-level outlining pass
> >
> > Hi everyone,
> >
> > Since I haven't said anything on the mailing list before, a quick
> > introduction. I'm an intern at Apple, and over the summer I
> > implemented a
> > prototype for an outlining pass for code size in LLVM. Now I'm
> > seeking to
> > eventually upstream it. I have the preliminary code on GitHub right
> > now,
> > but it's still very prototypical (see the code section).
> >
> > ================================
> > Motivation
> > ================================
> > The goal of the internship was to create an interprocedural pass that
> > would reduce code size as much as possible, perhaps at the cost of
> > some
> > performance. This would be useful to, say, embedded programmers who
> > only
> > have a few kilobytes to work with and a substantial amount of code to
> > fit
> > in that space.
> >
> >
> > ================================
> > Approach and Initial Results
> > ================================
> > To do this, we chose to implement an outliner. Outliners find
> > sequences of
> > instructions which would be better off as a function call, by some
> > measure
> > of "better". In this case, the measure of "better" is "makes code
> > smaller".
> >
> >
> > ================================
> > Results
> > ================================
> > These results are from a fairly recent 64-bit Intel processor, using
> > a
> > version of Clang equipped with the outliner prototype versus an
> > equivalent
> > version of Clang without the outliner.
> >
> > CODE SIZE REDUCTION
> > For tests >=4 Kb in non-outlined size, the outliner currently
> > provides an
> > average of 12.94% code size reduction on the LLVM test suite in
> > comparison
> > to a default Clang, up to 51.4% code size reduction. In comparison to
> > a
> > Clang with -Oz, the outliner provides an average of a 1.29% code size
> > reduction, up to a 37% code size reduction. I believe that the -Oz
> > numbers
> > can be further improved by better tuning the outlining cost model.
> >
> > EXECUTION TIME IMPACT
> > On average, the outliner increases execution time by 2% on the LLVM
> > test
> > suite, but has been also shown to improve exection time by up to 16%.
> > These results were from a fairly recent Intel processor, so the
> > results
> > may vary. Recent Intel processors have very low latency for function
> > calls, which may impact these results. Execution time improvements
> > are
> > likely dependent on the latency of function calls, instruction
> > caching
> > behaviour, and the execution frequency of the code being outlined. In
> > partucular, using a processor with heavy function call latency will
> > likely
> > increase execution time overhead.
> >
> >
> > ================================
> > Implementation
> > ================================
> > The outliner, in its current state, is a MachineModulePass. It finds
> > *identical* sequences of MIR, after register allocation, and pulls
> > them
> > out into their own functions. Thus, it's effectively assembly-level.
> > Ultimately, the algorithm used is general, so it can sit anywhere,
> > but MIR
> > was very convenient for the time being.
> >
> > It requires two data structures.
> >
> > 1. A generalized suffix tree
> > 2. A "terminated string"
> >
> > 1: The generalized suffix tree is constructed using Ukkonen's linear
> > time
> > construction algorithm [1]. They require linear space and support
> > linear-time substring queries. In practice, the construction time for
> > the
> > suffix tree is the most time consuming part, but I haven't noticed a
> > difference in compilation time on, say, 12 MB .ll files.
> >
> > 2: To support the suffix tree, we need a "terminated string." This is
> > a
> > generalized string with an unique terminator character appended to
> > the
> > end. TerminatedStrings can be built from any type.
> >
> > The algorithm is then roughly as follows.
> >
> > 1. For each MachineBasicBlock in the program, build a
> > TerminatedString for
> > that block.
> > 2. Build a suffix tree for that collection of strings.
> > 3. Query the suffix tree for the longest repeated substring and place
> > that
> > string in a candidate list. Repeat until none are found.
> > 4. Create functions for each candidate.
> > 5. Replace each candidate with a call to its respective function.
> >
> > Currently, the program itself isn't stored in the suffix tree, but
> > rather
> > a "proxy string" of integers. This isn't necessary at the MIR level,
> > but
> > may be for an IR level extension of the algorithm.
> >
> >
> > ================================
> > Challenges
> > ================================
> > 1) MEMORY CONSUMPTION
> > Given a string of length n, a naive suffix tree implementation can
> > take up
> > to 40n bytes of memory. However, this number can be reduced to 20n
> > with a
> > bit of work [2]. Currently, the suffix tree stores the entire
> > program,
> > including instructions which really ought not to be outlined, such as
> > returns. These instructions should not be included in the final
> > implementation, but should rather act as terminators for the strings.
> > This
> > will likely curb memory consumption. Suffix trees have been used in
> > the
> > past in sliding-window-based compression schemes, which may serve as
> > a
> > source of inspiration for reducing memory overhead.[3]
> >
> > Nonetheless, the outliner probably shouldn't be run unless it really
> > has
> > to be run. It will likely be used mostly in embedded spaces, where
> > the
> > programs have to fit into small devices anyway. Thus, memory overhead
> > for
> > the compiler likely won't be a problem. The outliner should only be
> > used
> > in -Oz compilations, and possibly should have its own flag.
> >
> >
> > 2) EXECUTION TIME
> > Currently, the outliner isn't tuned for preventing execution time
> > increases. There is an average of a 2% execution time hit on the
> > tests in
> > the LLVM test suite, with a few outliers of up to 30%. The outliers
> > are
> > tests which contain hot loops. The outliner really ought to be able
> > to use
> > profiling information and not outline from hot areas. Another
> > suggestion
> > people have given me is to add a "never outline" directive which
> > would
> > allow the user to say something along the lines of "this is a hot
> > loop,
> > please never outline from it".
> >
> > It's also important to note that these numbers are coming from a
> > fairly
> > recent Intel processor.
> >
> >
> > 3) CONSTRAINTS ON INSTRUCTIONS
> > The outliner currently won't pull anything out of functions which use
> > a
> > red zone. It also won't pull anything out that uses the stack,
> > instruction
> > pointer, uses constant pool indices, CFI indices, jump table indices,
> > or
> > frame indices. This removes many opportunities for outlining which
> > would
> > likely be available at a higher level (such as IR). Thus, there's a
> > case
> > for moving this up to a higher level.
> >
> >
> > ================================
> > Additional Applications
> > ================================
> > The suffix tree itself could be used as a tool for finding
> > opportunities
> > to refactor code. For example, it could recognize places where the
> > user
> > likely copied and pasted some code. This could be run on codebases to
> > find
> > areas where people could manually outline things at the source level.
> >
> > Using the terminated string class, it would also be possible to
> > implement
> > other string algorithms on code. This may open the door to new ways
> > to
> > analyze existing codebases.
> >
> >
> > ================================
> > Roadmap
> > ================================
> > The current outliner is *very* prototypical. The version I would want
> > to
> > upstream would be a new implementation. Here's what I'd like to
> > address
> > and accomplish.
> >
> > 1. Ask "what does the LLVM community want from an outliner" and use
> > that
> > to drive development of the algorithm.
> > 2. Reimplement the outliner, perhaps using a less memory-intensve
> > data
> > structure like a suffix array.
> > 3. Begin adding features to the algorithm, for example:
> >     i.   Teaching the algorithm about hot/cold blocks of code and
> >     taking
> > that into account.
> >     ii.  Simple parameter passing.
> >     iii. Similar function outlining-- eg, noticing that two outlining
> > candidates are similar and can be merged into one function with some
> > control flow.
> >
> >
> > ================================
> > Code
> > ================================
> > Note: This code requires MachineModulePasses
> >
> > * Main pass:
> > https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h
> >
> > * Suffix tree:
> > https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h
> >
> > * TerminatedString and TerminatedStringList:
> > https://github.com/ornata/llvm/blob/master/include/llvm/
> ADT/TerminatedString.h
> >
> > Here are a couple unit tests for the data structures.
> >
> > * Suffix tree unit tests:
> > https://github.com/ornata/llvm/blob/master/unittests/
> ADT/SuffixTreeTest.cpp
> >
> > * TerminatedString unit tests:
> > https://github.com/ornata/llvm/blob/master/unittests/
> ADT/TerminatedStringTest.cpp
> >
> > * TerminatedStringList unit tests:
> > https://github.com/ornata/llvm/blob/master/unittests/
> ADT/TerminatedStringListTest.cpp
> >
> >
> > ================================
> > References
> > ================================
> > [1] Ukkonen's Algorithm:
> > https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
> > [2] Suffix Trees and Suffix Arrays:
> > http://web.cs.iastate.edu/~cs548/suffix.pdf
> > [3] Extended Application of Suffix Trees to Data Compression:
> > http://www.larsson.dogma.net/dccpaper.pdf
> >
> >
> > Thanks for reading,
> > Jessica
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160826/4bc04cb6/attachment-0001.html>