[llvm-dev] [RFC] Interprocedural MIR-level outlining pass
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Fri Aug 26 15:01:00 PDT 2016
----- Original Message -----
> From: "Kevin Choi" <code.kchoi at gmail.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Friday, August 26, 2016 4:55:29 PM
> Subject: Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining
> pass
> I think the "Motivation" section explained that.
I don't think it explained it.
> I too first thought about "why not at IR?" but the reason looks like
> MIR, post-RA has the most accurate heuristics (best way to know
> looks like actually getting there).
But also, potentially, the fewest opportunities. That's why I'm curious about the motivation - the trade offs are not obvious to me.
-Hal
> Do you know if there is any experimental pass that relies on deriving
> heuristics by a feedback loop after letting, ie. a duplicate
> module/function/block continue past?
> Regards,
> Kevin
> On 26 August 2016 at 14:36, Hal Finkel via llvm-dev <
> llvm-dev at lists.llvm.org > wrote:
> > Hi Jessica,
>
> > This is quite interesting.
>
> > Can you comment on why you started by doing this at the MI level,
> > as
> > opposed to the IR level. And at the MI level, why after RA instead
> > of before RA?
>
> > Perhaps the first question is another way of asking about how
> > accurately we can model call-site code size at the IR level?
>
> > Thanks in advance,
>
> > Hal
>
> > ----- Original Message -----
>
> > > From: "Jessica Paquette via llvm-dev" < llvm-dev at lists.llvm.org >
>
> > > To: llvm-dev at lists.llvm.org
>
> > > Sent: Friday, August 26, 2016 4:26:09 PM
>
> > > Subject: [llvm-dev] [RFC] Interprocedural MIR-level outlining
> > > pass
>
> > >
>
> > > Hi everyone,
>
> > >
>
> > > Since I haven't said anything on the mailing list before, a quick
>
> > > introduction. I'm an intern at Apple, and over the summer I
>
> > > implemented a
>
> > > prototype for an outlining pass for code size in LLVM. Now I'm
>
> > > seeking to
>
> > > eventually upstream it. I have the preliminary code on GitHub
> > > right
>
> > > now,
>
> > > but it's still very prototypical (see the code section).
>
> > >
>
> > > ================================
>
> > > Motivation
>
> > > ================================
>
> > > The goal of the internship was to create an interprocedural pass
> > > that
>
> > > would reduce code size as much as possible, perhaps at the cost
> > > of
>
> > > some
>
> > > performance. This would be useful to, say, embedded programmers
> > > who
>
> > > only
>
> > > have a few kilobytes to work with and a substantial amount of
> > > code
> > > to
>
> > > fit
>
> > > in that space.
>
> > >
>
> > >
>
> > > ================================
>
> > > Approach and Initial Results
>
> > > ================================
>
> > > To do this, we chose to implement an outliner. Outliners find
>
> > > sequences of
>
> > > instructions which would be better off as a function call, by
> > > some
>
> > > measure
>
> > > of "better". In this case, the measure of "better" is "makes code
>
> > > smaller".
>
> > >
>
> > >
>
> > > ================================
>
> > > Results
>
> > > ================================
>
> > > These results are from a fairly recent 64-bit Intel processor,
> > > using
>
> > > a
>
> > > version of Clang equipped with the outliner prototype versus an
>
> > > equivalent
>
> > > version of Clang without the outliner.
>
> > >
>
> > > CODE SIZE REDUCTION
>
> > > For tests >=4 Kb in non-outlined size, the outliner currently
>
> > > provides an
>
> > > average of 12.94% code size reduction on the LLVM test suite in
>
> > > comparison
>
> > > to a default Clang, up to 51.4% code size reduction. In
> > > comparison
> > > to
>
> > > a
>
> > > Clang with -Oz, the outliner provides an average of a 1.29% code
> > > size
>
> > > reduction, up to a 37% code size reduction. I believe that the
> > > -Oz
>
> > > numbers
>
> > > can be further improved by better tuning the outlining cost
> > > model.
>
> > >
>
> > > EXECUTION TIME IMPACT
>
> > > On average, the outliner increases execution time by 2% on the
> > > LLVM
>
> > > test
>
> > > suite, but has been also shown to improve exection time by up to
> > > 16%.
>
> > > These results were from a fairly recent Intel processor, so the
>
> > > results
>
> > > may vary. Recent Intel processors have very low latency for
> > > function
>
> > > calls, which may impact these results. Execution time
> > > improvements
>
> > > are
>
> > > likely dependent on the latency of function calls, instruction
>
> > > caching
>
> > > behaviour, and the execution frequency of the code being
> > > outlined.
> > > In
>
> > > partucular, using a processor with heavy function call latency
> > > will
>
> > > likely
>
> > > increase execution time overhead.
>
> > >
>
> > >
>
> > > ================================
>
> > > Implementation
>
> > > ================================
>
> > > The outliner, in its current state, is a MachineModulePass. It
> > > finds
>
> > > *identical* sequences of MIR, after register allocation, and
> > > pulls
>
> > > them
>
> > > out into their own functions. Thus, it's effectively
> > > assembly-level.
>
> > > Ultimately, the algorithm used is general, so it can sit
> > > anywhere,
>
> > > but MIR
>
> > > was very convenient for the time being.
>
> > >
>
> > > It requires two data structures.
>
> > >
>
> > > 1. A generalized suffix tree
>
> > > 2. A "terminated string"
>
> > >
>
> > > 1: The generalized suffix tree is constructed using Ukkonen's
> > > linear
>
> > > time
>
> > > construction algorithm [1]. They require linear space and support
>
> > > linear-time substring queries. In practice, the construction time
> > > for
>
> > > the
>
> > > suffix tree is the most time consuming part, but I haven't
> > > noticed
> > > a
>
> > > difference in compilation time on, say, 12 MB .ll files.
>
> > >
>
> > > 2: To support the suffix tree, we need a "terminated string."
> > > This
> > > is
>
> > > a
>
> > > generalized string with an unique terminator character appended
> > > to
>
> > > the
>
> > > end. TerminatedStrings can be built from any type.
>
> > >
>
> > > The algorithm is then roughly as follows.
>
> > >
>
> > > 1. For each MachineBasicBlock in the program, build a
>
> > > TerminatedString for
>
> > > that block.
>
> > > 2. Build a suffix tree for that collection of strings.
>
> > > 3. Query the suffix tree for the longest repeated substring and
> > > place
>
> > > that
>
> > > string in a candidate list. Repeat until none are found.
>
> > > 4. Create functions for each candidate.
>
> > > 5. Replace each candidate with a call to its respective function.
>
> > >
>
> > > Currently, the program itself isn't stored in the suffix tree,
> > > but
>
> > > rather
>
> > > a "proxy string" of integers. This isn't necessary at the MIR
> > > level,
>
> > > but
>
> > > may be for an IR level extension of the algorithm.
>
> > >
>
> > >
>
> > > ================================
>
> > > Challenges
>
> > > ================================
>
> > > 1) MEMORY CONSUMPTION
>
> > > Given a string of length n, a naive suffix tree implementation
> > > can
>
> > > take up
>
> > > to 40n bytes of memory. However, this number can be reduced to
> > > 20n
>
> > > with a
>
> > > bit of work [2]. Currently, the suffix tree stores the entire
>
> > > program,
>
> > > including instructions which really ought not to be outlined,
> > > such
> > > as
>
> > > returns. These instructions should not be included in the final
>
> > > implementation, but should rather act as terminators for the
> > > strings.
>
> > > This
>
> > > will likely curb memory consumption. Suffix trees have been used
> > > in
>
> > > the
>
> > > past in sliding-window-based compression schemes, which may serve
> > > as
>
> > > a
>
> > > source of inspiration for reducing memory overhead.[3]
>
> > >
>
> > > Nonetheless, the outliner probably shouldn't be run unless it
> > > really
>
> > > has
>
> > > to be run. It will likely be used mostly in embedded spaces,
> > > where
>
> > > the
>
> > > programs have to fit into small devices anyway. Thus, memory
> > > overhead
>
> > > for
>
> > > the compiler likely won't be a problem. The outliner should only
> > > be
>
> > > used
>
> > > in -Oz compilations, and possibly should have its own flag.
>
> > >
>
> > >
>
> > > 2) EXECUTION TIME
>
> > > Currently, the outliner isn't tuned for preventing execution time
>
> > > increases. There is an average of a 2% execution time hit on the
>
> > > tests in
>
> > > the LLVM test suite, with a few outliers of up to 30%. The
> > > outliers
>
> > > are
>
> > > tests which contain hot loops. The outliner really ought to be
> > > able
>
> > > to use
>
> > > profiling information and not outline from hot areas. Another
>
> > > suggestion
>
> > > people have given me is to add a "never outline" directive which
>
> > > would
>
> > > allow the user to say something along the lines of "this is a hot
>
> > > loop,
>
> > > please never outline from it".
>
> > >
>
> > > It's also important to note that these numbers are coming from a
>
> > > fairly
>
> > > recent Intel processor.
>
> > >
>
> > >
>
> > > 3) CONSTRAINTS ON INSTRUCTIONS
>
> > > The outliner currently won't pull anything out of functions which
> > > use
>
> > > a
>
> > > red zone. It also won't pull anything out that uses the stack,
>
> > > instruction
>
> > > pointer, uses constant pool indices, CFI indices, jump table
> > > indices,
>
> > > or
>
> > > frame indices. This removes many opportunities for outlining
> > > which
>
> > > would
>
> > > likely be available at a higher level (such as IR). Thus, there's
> > > a
>
> > > case
>
> > > for moving this up to a higher level.
>
> > >
>
> > >
>
> > > ================================
>
> > > Additional Applications
>
> > > ================================
>
> > > The suffix tree itself could be used as a tool for finding
>
> > > opportunities
>
> > > to refactor code. For example, it could recognize places where
> > > the
>
> > > user
>
> > > likely copied and pasted some code. This could be run on
> > > codebases
> > > to
>
> > > find
>
> > > areas where people could manually outline things at the source
> > > level.
>
> > >
>
> > > Using the terminated string class, it would also be possible to
>
> > > implement
>
> > > other string algorithms on code. This may open the door to new
> > > ways
>
> > > to
>
> > > analyze existing codebases.
>
> > >
>
> > >
>
> > > ================================
>
> > > Roadmap
>
> > > ================================
>
> > > The current outliner is *very* prototypical. The version I would
> > > want
>
> > > to
>
> > > upstream would be a new implementation. Here's what I'd like to
>
> > > address
>
> > > and accomplish.
>
> > >
>
> > > 1. Ask "what does the LLVM community want from an outliner" and
> > > use
>
> > > that
>
> > > to drive development of the algorithm.
>
> > > 2. Reimplement the outliner, perhaps using a less memory-intensve
>
> > > data
>
> > > structure like a suffix array.
>
> > > 3. Begin adding features to the algorithm, for example:
>
> > > i. Teaching the algorithm about hot/cold blocks of code and
>
> > > taking
>
> > > that into account.
>
> > > ii. Simple parameter passing.
>
> > > iii. Similar function outlining-- eg, noticing that two outlining
>
> > > candidates are similar and can be merged into one function with
> > > some
>
> > > control flow.
>
> > >
>
> > >
>
> > > ================================
>
> > > Code
>
> > > ================================
>
> > > Note: This code requires MachineModulePasses
>
> > >
>
> > > * Main pass:
>
> > > https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h
>
> > >
>
> > > * Suffix tree:
>
> > > https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h
>
> > >
>
> > > * TerminatedString and TerminatedStringList:
>
> > > https://github.com/ornata/llvm/blob/master/include/llvm/ADT/TerminatedString.h
>
> > >
>
> > > Here are a couple unit tests for the data structures.
>
> > >
>
> > > * Suffix tree unit tests:
>
> > > https://github.com/ornata/llvm/blob/master/unittests/ADT/SuffixTreeTest.cpp
>
> > >
>
> > > * TerminatedString unit tests:
>
> > > https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringTest.cpp
>
> > >
>
> > > * TerminatedStringList unit tests:
>
> > > https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringListTest.cpp
>
> > >
>
> > >
>
> > > ================================
>
> > > References
>
> > > ================================
>
> > > [1] Ukkonen's Algorithm:
>
> > > https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
>
> > > [2] Suffix Trees and Suffix Arrays:
>
> > > http://web.cs.iastate.edu/~cs548/suffix.pdf
>
> > > [3] Extended Application of Suffix Trees to Data Compression:
>
> > > http://www.larsson.dogma.net/dccpaper.pdf
>
> > >
>
> > >
>
> > > Thanks for reading,
>
> > > Jessica
>
> > >
>
> > > _______________________________________________
>
> > > LLVM Developers mailing list
>
> > > llvm-dev at lists.llvm.org
>
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> > >
>
> > --
>
> > Hal Finkel
>
> > Assistant Computational Scientist
>
> > Leadership Computing Facility
>
> > Argonne National Laboratory
>
> > _______________________________________________
>
> > LLVM Developers mailing list
>
> > llvm-dev at lists.llvm.org
>
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160826/0d4885d5/attachment.html>
More information about the llvm-dev
mailing list