[llvm-dev] [RFC] Interprocedural MIR-level outlining pass

Fri Aug 26 15:01:00 PDT 2016

----- Original Message -----

> From: "Kevin Choi" <code.kchoi at gmail.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Friday, August 26, 2016 4:55:29 PM
> Subject: Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining
> pass

> I think the "Motivation" section explained that.
I don't think it explained it. 

> I too first thought about "why not at IR?" but the reason looks like
> MIR, post-RA has the most accurate heuristics (best way to know
> looks like actually getting there).
But also, potentially, the fewest opportunities. That's why I'm curious about the motivation - the trade offs are not obvious to me. 

-Hal 

> Do you know if there is any experimental pass that relies on deriving
> heuristics by a feedback loop after letting, ie. a duplicate
> module/function/block continue past?

> Regards,

> Kevin

> On 26 August 2016 at 14:36, Hal Finkel via llvm-dev <
> llvm-dev at lists.llvm.org > wrote:

> > Hi Jessica,
> 

> > This is quite interesting.
> 

> > Can you comment on why you started by doing this at the MI level,
> > as
> > opposed to the IR level. And at the MI level, why after RA instead
> > of before RA?
> 

> > Perhaps the first question is another way of asking about how
> > accurately we can model call-site code size at the IR level?
> 

> > Thanks in advance,
> 
> > Hal
> 

> > ----- Original Message -----
> 

> > > From: "Jessica Paquette via llvm-dev" < llvm-dev at lists.llvm.org >
> 
> > > To: llvm-dev at lists.llvm.org
> 
> > > Sent: Friday, August 26, 2016 4:26:09 PM
> 
> > > Subject: [llvm-dev] [RFC] Interprocedural MIR-level outlining
> > > pass
> 
> > >
> 
> > > Hi everyone,
> 
> > >
> 
> > > Since I haven't said anything on the mailing list before, a quick
> 
> > > introduction. I'm an intern at Apple, and over the summer I
> 
> > > implemented a
> 
> > > prototype for an outlining pass for code size in LLVM. Now I'm
> 
> > > seeking to
> 
> > > eventually upstream it. I have the preliminary code on GitHub
> > > right
> 
> > > now,
> 
> > > but it's still very prototypical (see the code section).
> 
> > >
> 
> > > ================================
> 
> > > Motivation
> 
> > > ================================
> 
> > > The goal of the internship was to create an interprocedural pass
> > > that
> 
> > > would reduce code size as much as possible, perhaps at the cost
> > > of
> 
> > > some
> 
> > > performance. This would be useful to, say, embedded programmers
> > > who
> 
> > > only
> 
> > > have a few kilobytes to work with and a substantial amount of
> > > code
> > > to
> 
> > > fit
> 
> > > in that space.
> 
> > >
> 
> > >
> 
> > > ================================
> 
> > > Approach and Initial Results
> 
> > > ================================
> 
> > > To do this, we chose to implement an outliner. Outliners find
> 
> > > sequences of
> 
> > > instructions which would be better off as a function call, by
> > > some
> 
> > > measure
> 
> > > of "better". In this case, the measure of "better" is "makes code
> 
> > > smaller".
> 
> > >
> 
> > >
> 
> > > ================================
> 
> > > Results
> 
> > > ================================
> 
> > > These results are from a fairly recent 64-bit Intel processor,
> > > using
> 
> > > a
> 
> > > version of Clang equipped with the outliner prototype versus an
> 
> > > equivalent
> 
> > > version of Clang without the outliner.
> 
> > >
> 
> > > CODE SIZE REDUCTION
> 
> > > For tests >=4 Kb in non-outlined size, the outliner currently
> 
> > > provides an
> 
> > > average of 12.94% code size reduction on the LLVM test suite in
> 
> > > comparison
> 
> > > to a default Clang, up to 51.4% code size reduction. In
> > > comparison
> > > to
> 
> > > a
> 
> > > Clang with -Oz, the outliner provides an average of a 1.29% code
> > > size
> 
> > > reduction, up to a 37% code size reduction. I believe that the
> > > -Oz
> 
> > > numbers
> 
> > > can be further improved by better tuning the outlining cost
> > > model.
> 
> > >
> 
> > > EXECUTION TIME IMPACT
> 
> > > On average, the outliner increases execution time by 2% on the
> > > LLVM
> 
> > > test
> 
> > > suite, but has been also shown to improve exection time by up to
> > > 16%.
> 
> > > These results were from a fairly recent Intel processor, so the
> 
> > > results
> 
> > > may vary. Recent Intel processors have very low latency for
> > > function
> 
> > > calls, which may impact these results. Execution time
> > > improvements
> 
> > > are
> 
> > > likely dependent on the latency of function calls, instruction
> 
> > > caching
> 
> > > behaviour, and the execution frequency of the code being
> > > outlined.
> > > In
> 
> > > partucular, using a processor with heavy function call latency
> > > will
> 
> > > likely
> 
> > > increase execution time overhead.
> 
> > >
> 
> > >
> 
> > > ================================
> 
> > > Implementation
> 
> > > ================================
> 
> > > The outliner, in its current state, is a MachineModulePass. It
> > > finds
> 
> > > *identical* sequences of MIR, after register allocation, and
> > > pulls
> 
> > > them
> 
> > > out into their own functions. Thus, it's effectively
> > > assembly-level.
> 
> > > Ultimately, the algorithm used is general, so it can sit
> > > anywhere,
> 
> > > but MIR
> 
> > > was very convenient for the time being.
> 
> > >
> 
> > > It requires two data structures.
> 
> > >
> 
> > > 1. A generalized suffix tree
> 
> > > 2. A "terminated string"
> 
> > >
> 
> > > 1: The generalized suffix tree is constructed using Ukkonen's
> > > linear
> 
> > > time
> 
> > > construction algorithm [1]. They require linear space and support
> 
> > > linear-time substring queries. In practice, the construction time
> > > for
> 
> > > the
> 
> > > suffix tree is the most time consuming part, but I haven't
> > > noticed
> > > a
> 
> > > difference in compilation time on, say, 12 MB .ll files.
> 
> > >
> 
> > > 2: To support the suffix tree, we need a "terminated string."
> > > This
> > > is
> 
> > > a
> 
> > > generalized string with an unique terminator character appended
> > > to
> 
> > > the
> 
> > > end. TerminatedStrings can be built from any type.
> 
> > >
> 
> > > The algorithm is then roughly as follows.
> 
> > >
> 
> > > 1. For each MachineBasicBlock in the program, build a
> 
> > > TerminatedString for
> 
> > > that block.
> 
> > > 2. Build a suffix tree for that collection of strings.
> 
> > > 3. Query the suffix tree for the longest repeated substring and
> > > place
> 
> > > that
> 
> > > string in a candidate list. Repeat until none are found.
> 
> > > 4. Create functions for each candidate.
> 
> > > 5. Replace each candidate with a call to its respective function.
> 
> > >
> 
> > > Currently, the program itself isn't stored in the suffix tree,
> > > but
> 
> > > rather
> 
> > > a "proxy string" of integers. This isn't necessary at the MIR
> > > level,
> 
> > > but
> 
> > > may be for an IR level extension of the algorithm.
> 
> > >
> 
> > >
> 
> > > ================================
> 
> > > Challenges
> 
> > > ================================
> 
> > > 1) MEMORY CONSUMPTION
> 
> > > Given a string of length n, a naive suffix tree implementation
> > > can
> 
> > > take up
> 
> > > to 40n bytes of memory. However, this number can be reduced to
> > > 20n
> 
> > > with a
> 
> > > bit of work [2]. Currently, the suffix tree stores the entire
> 
> > > program,
> 
> > > including instructions which really ought not to be outlined,
> > > such
> > > as
> 
> > > returns. These instructions should not be included in the final
> 
> > > implementation, but should rather act as terminators for the
> > > strings.
> 
> > > This
> 
> > > will likely curb memory consumption. Suffix trees have been used
> > > in
> 
> > > the
> 
> > > past in sliding-window-based compression schemes, which may serve
> > > as
> 
> > > a
> 
> > > source of inspiration for reducing memory overhead.[3]
> 
> > >
> 
> > > Nonetheless, the outliner probably shouldn't be run unless it
> > > really
> 
> > > has
> 
> > > to be run. It will likely be used mostly in embedded spaces,
> > > where
> 
> > > the
> 
> > > programs have to fit into small devices anyway. Thus, memory
> > > overhead
> 
> > > for
> 
> > > the compiler likely won't be a problem. The outliner should only
> > > be
> 
> > > used
> 
> > > in -Oz compilations, and possibly should have its own flag.
> 
> > >
> 
> > >
> 
> > > 2) EXECUTION TIME
> 
> > > Currently, the outliner isn't tuned for preventing execution time
> 
> > > increases. There is an average of a 2% execution time hit on the
> 
> > > tests in
> 
> > > the LLVM test suite, with a few outliers of up to 30%. The
> > > outliers
> 
> > > are
> 
> > > tests which contain hot loops. The outliner really ought to be
> > > able
> 
> > > to use
> 
> > > profiling information and not outline from hot areas. Another
> 
> > > suggestion
> 
> > > people have given me is to add a "never outline" directive which
> 
> > > would
> 
> > > allow the user to say something along the lines of "this is a hot
> 
> > > loop,
> 
> > > please never outline from it".
> 
> > >
> 
> > > It's also important to note that these numbers are coming from a
> 
> > > fairly
> 
> > > recent Intel processor.
> 
> > >
> 
> > >
> 
> > > 3) CONSTRAINTS ON INSTRUCTIONS
> 
> > > The outliner currently won't pull anything out of functions which
> > > use
> 
> > > a
> 
> > > red zone. It also won't pull anything out that uses the stack,
> 
> > > instruction
> 
> > > pointer, uses constant pool indices, CFI indices, jump table
> > > indices,
> 
> > > or
> 
> > > frame indices. This removes many opportunities for outlining
> > > which
> 
> > > would
> 
> > > likely be available at a higher level (such as IR). Thus, there's
> > > a
> 
> > > case
> 
> > > for moving this up to a higher level.
> 
> > >
> 
> > >
> 
> > > ================================
> 
> > > Additional Applications
> 
> > > ================================
> 
> > > The suffix tree itself could be used as a tool for finding
> 
> > > opportunities
> 
> > > to refactor code. For example, it could recognize places where
> > > the
> 
> > > user
> 
> > > likely copied and pasted some code. This could be run on
> > > codebases
> > > to
> 
> > > find
> 
> > > areas where people could manually outline things at the source
> > > level.
> 
> > >
> 
> > > Using the terminated string class, it would also be possible to
> 
> > > implement
> 
> > > other string algorithms on code. This may open the door to new
> > > ways
> 
> > > to
> 
> > > analyze existing codebases.
> 
> > >
> 
> > >
> 
> > > ================================
> 
> > > Roadmap
> 
> > > ================================
> 
> > > The current outliner is *very* prototypical. The version I would
> > > want
> 
> > > to
> 
> > > upstream would be a new implementation. Here's what I'd like to
> 
> > > address
> 
> > > and accomplish.
> 
> > >
> 
> > > 1. Ask "what does the LLVM community want from an outliner" and
> > > use
> 
> > > that
> 
> > > to drive development of the algorithm.
> 
> > > 2. Reimplement the outliner, perhaps using a less memory-intensve
> 
> > > data
> 
> > > structure like a suffix array.
> 
> > > 3. Begin adding features to the algorithm, for example:
> 
> > > i. Teaching the algorithm about hot/cold blocks of code and
> 
> > > taking
> 
> > > that into account.
> 
> > > ii. Simple parameter passing.
> 
> > > iii. Similar function outlining-- eg, noticing that two outlining
> 
> > > candidates are similar and can be merged into one function with
> > > some
> 
> > > control flow.
> 
> > >
> 
> > >
> 
> > > ================================
> 
> > > Code
> 
> > > ================================
> 
> > > Note: This code requires MachineModulePasses
> 
> > >
> 
> > > * Main pass:
> 
> > > https://github.com/ornata/llvm/blob/master/lib/CodeGen/MachineOutliner.h
> 
> > >
> 
> > > * Suffix tree:
> 
> > > https://github.com/ornata/llvm/blob/master/include/llvm/ADT/SuffixTree.h
> 
> > >
> 
> > > * TerminatedString and TerminatedStringList:
> 
> > > https://github.com/ornata/llvm/blob/master/include/llvm/ADT/TerminatedString.h
> 
> > >
> 
> > > Here are a couple unit tests for the data structures.
> 
> > >
> 
> > > * Suffix tree unit tests:
> 
> > > https://github.com/ornata/llvm/blob/master/unittests/ADT/SuffixTreeTest.cpp
> 
> > >
> 
> > > * TerminatedString unit tests:
> 
> > > https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringTest.cpp
> 
> > >
> 
> > > * TerminatedStringList unit tests:
> 
> > > https://github.com/ornata/llvm/blob/master/unittests/ADT/TerminatedStringListTest.cpp
> 
> > >
> 
> > >
> 
> > > ================================
> 
> > > References
> 
> > > ================================
> 
> > > [1] Ukkonen's Algorithm:
> 
> > > https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf
> 
> > > [2] Suffix Trees and Suffix Arrays:
> 
> > > http://web.cs.iastate.edu/~cs548/suffix.pdf
> 
> > > [3] Extended Application of Suffix Trees to Data Compression:
> 
> > > http://www.larsson.dogma.net/dccpaper.pdf
> 
> > >
> 
> > >
> 
> > > Thanks for reading,
> 
> > > Jessica
> 
> > >
> 
> > > _______________________________________________
> 
> > > LLVM Developers mailing list
> 
> > > llvm-dev at lists.llvm.org
> 
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> > >
> 

> > --
> 
> > Hal Finkel
> 
> > Assistant Computational Scientist
> 
> > Leadership Computing Facility
> 
> > Argonne National Laboratory
> 

> > _______________________________________________
> 
> > LLVM Developers mailing list
> 
> > llvm-dev at lists.llvm.org
> 
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 

-- 

Hal Finkel 
Assistant Computational Scientist 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160826/0d4885d5/attachment.html>