[llvm-commits] CVS: llvm/www/docs/HistoricalNotes/2003-06-25-Reoptimizer1.txt 2003-06-26-Reoptimizer2.txt

Thu Jun 26 15:38:01 PDT 2003

Changes in directory llvm/www/docs/HistoricalNotes:

2003-06-25-Reoptimizer1.txt added (r1.1)
2003-06-26-Reoptimizer2.txt added (r1.1)

---
Log message:

Here are the notes from our Reoptimizer meetings.

---
Diffs of the changes:

Index: llvm/www/docs/HistoricalNotes/2003-06-25-Reoptimizer1.txt
diff -c /dev/null llvm/www/docs/HistoricalNotes/2003-06-25-Reoptimizer1.txt:1.1
*** /dev/null	Thu Jun 26 15:37:53 2003
--- llvm/www/docs/HistoricalNotes/2003-06-25-Reoptimizer1.txt	Thu Jun 26 15:37:42 2003
***************
*** 0 ****
--- 1,137 ----
+ Wed Jun 25 15:13:51 CDT 2003
+ 
+ First-level instrumentation
+ ---------------------------
+ 
+ We use opt to do Bytecode-to-bytecode instrumentation. Look at
+ back-edges and insert llvm_first_trigger() function call which takes
+ no arguments and no return value. This instrumentation is designed to
+ be easy to remove, for instance by writing a NOP over the function
+ call instruction.
+ 
+ Keep count of every call to llvm_first_trigger(), and maintain
+ counters in a map indexed by return address. If the trigger count
+ exceeds a threshold, we identify a hot loop and perform second-level
+ instrumentation on the hot loop region (the instructions between the
+ target of the back-edge and the branch that causes the back-edge).  We
+ do not move code across basic-block boundaries.
+ 
+ 
+ Second-level instrumentation
+ ---------------------------
+ 
+ We remove the first-level instrumentation by overwriting the CALL to
+ llvm_first_trigger() with a NOP.
+ 
+ The reoptimizer maintains a map between machine-code basic blocks and
+ LLVM BasicBlock*s.  We only keep track of paths that start at the
+ first machine-code basic block of the hot loop region.
+ 
+ How do we keep track of which edges to instrument, and which edges are
+ exits from the hot region? 3 step process.
+ 
+ 1) Do a DFS from the first machine-code basic block of the hot loop
+ region and mark reachable edges.
+ 
+ 2) Do a DFS from the last machine-code basic block of the hot loop
+ region IGNORING back edges, and mark the edges which are reachable in
+ 1) and also in 2) (i.e., must be reachable from both the start BB and
+ the end BB of the hot region).
+ 
+ 3) Mark BBs which end in edges that exit the hot region; we need to
+ instrument these differently.
+ 
+ Assume that there is 1 free register. On SPARC we use %g1, which LLC
+ has agreed not to use.  Shift a 1 into it at the beginning. At every
+ edge which corresponds to a conditional branch, we shift 0 for not
+ taken and 1 for taken into a register. This uniquely numbers the paths
+ through the hot region. Silently fail if we need more than 64 bits.
+ 
+ At the end BB we call countPath and increment the counter based on %g1
+ and the return address of the countPath call.  We keep track of the
+ number of iterations and the number of paths.  We only run this
+ version 30 or 40 times.
+ 
+ Find the BBs that total 90% or more of execution, and aggregate them
+ together to form our trace. But we do not allow more than 5 paths; if
+ we have more than 5 we take the ones that are executed the most.  We
+ verify our assumption that we picked a hot back-edge in first-level
+ instrumentation, by making sure that the number of times we took an
+ exit edge from the hot trace is less than 10% of the number of
+ iterations.
+ 
+ LLC has been taught to recognize llvm_first_trigger() calls and NOT
+ generate saves and restores of caller-saved registers around these
+ calls.
+ 
+ 
+ Phase behavior
+ --------------
+ 
+ We turn off llvm_first_trigger() calls with NOPs, but this would hide
+ phase behavior from us (when some funcs/traces stop being hot and
+ others become hot.)
+ 
+ We have a SIGALRM timer that counts time for us. Every time we get a
+ SIGALRM we look at our priority queue of locations where we have
+ removed llvm_first_trigger() calls. Each location is inserted along
+ with a time when we will next turn instrumentation back on for that
+ call site. If the time has arrived for a particular call site, we pop
+ that off the prio. queue and turn instrumentation back on for that
+ call site.
+ 
+ 
+ Generating traces
+ -----------------
+ 
+ When we finally generate an optimized trace we first copy the code
+ into the trace cache. This leaves us with 3 copies of the code: the
+ original code, the instrumented code, and the optimized trace. The
+ optimized trace does not have instrumentation. The original code and
+ the instrumented code are modified to have a branch to the trace
+ cache, where the optimized traces are kept.
+ 
+ We copy the code from the original to the instrumentation version
+ by tracing the LLVM-to-Machine code basic block map and then copying
+ each machine code basic block we think is in the hot region into the
+ trace cache. Then we instrument that code. The process is similar for
+ generating the final optimized trace; we copy the same basic blocks
+ because we might need to put in fixup code for exit BBs.
+ 
+ LLVM basic blocks are not typically used in the Reoptimizer except
+ for the mapping information.
+ 
+ We are restricted to using single instructions to branch between the
+ original code, trace, and instrumented code. So we have to keep the
+ code copies in memory near the original code (they can't be far enough
+ away that a single pc-relative branch would not work.) Malloc() or
+ data region space is too far away. this impacts the design of the 
+ trace cache.
+ 
+ We use a dummy function that is full of a bunch of for loops which we
+ overwrite with trace-cache code. The trace manager keeps track of
+ whether or not we have enough space in the trace cache, etc.
+ 
+ The trace insertion routine takes an original start address, a vector
+ of machine instructions representing the trace, index of branches and
+ their corresponding absolute targets, and index of calls and their
+ corresponding absolute targets.
+ 
+ The trace insertion routine is responsible for inserting branches from
+ the beginning of the original code to the beginning of the optimized
+ trace. This is because at some point the trace cache may run out of
+ space and it may have to evict a trace, at which point the branch to
+ the trace would also have to be removed. It uses a round-robin
+ replacement policy; we have found that this is almost as good as LRU
+ and better than random (especially because of problems fitting the new
+ trace in.)
+ 
+ We cannot deal with discontiguous trace cache areas.  The trace cache
+ is supposed to be cache-line-aligned, but it is not page-aligned.
+ 
+ We generate instrumentation traces and optimized traces into separate
+ trace caches. We keep the instrumented code around because you don't
+ want to delete a trace when you still might have to return to it
+ (i.e., return from a llvm_first_trigger() or countPath() call.)
+ 
+ 

Index: llvm/www/docs/HistoricalNotes/2003-06-26-Reoptimizer2.txt
diff -c /dev/null llvm/www/docs/HistoricalNotes/2003-06-26-Reoptimizer2.txt:1.1
*** /dev/null	Thu Jun 26 15:37:53 2003
--- llvm/www/docs/HistoricalNotes/2003-06-26-Reoptimizer2.txt	Thu Jun 26 15:37:42 2003
***************
*** 0 ****
--- 1,110 ----
+ Thu Jun 26 14:43:04 CDT 2003
+ 
+ Information about BinInterface
+ ------------------------------
+ 
+ Take in a set of instructions with some particular register
+ allocation. It allows you to add, modify, or delete some instructions,
+ in SSA form (kind of like LLVM's MachineInstrs.) Then re-allocate
+ registers. It assumes that the transformations you are doing are safe.
+ It does not update the mapping information or the LLVM representation
+ for the modified trace (so it would not, for instance, support
+ multiple optimization passes; passes have to be aware of and update
+ manually the mapping information.)
+ 
+ The way you use it is you take the original code and provide it to
+ BinInterface; then you do optimizations to it, then you put it in the
+ trace cache.
+ 
+ The BinInterface tries to find live-outs for traces so that it can do
+ register allocation on just the trace, and stitch the trace back into
+ the original code. It has to preserve the live-ins and live-outs when
+ it does its register allocation.  (On exits from the trace we have
+ epilogues that copy live-outs back into the right registers, but
+ live-ins have to be in the right registers.)
+ 
+ 
+ Limitations of BinInterface
+ ---------------------------
+ 
+ It does copy insertions for PHIs, which it infers from the machine
+ code. The mapping info inserted by LLC is not sufficient to determine
+ the PHIs.
+ 
+ It does not handle integer or floating-point condition codes and it
+ does not handle floating-point register allocation.
+ 
+ It is not aggressively able to use lots of registers.
+ 
+ There is a problem with alloca: we cannot find our spill space for
+ spilling registers, normally allocated on the stack, if the trace
+ follows an alloca(). What might be an acceptable solution would be to
+ disable trace generation on functions that have variable-sized
+ alloca()s. Variable-sized allocas in the trace would also probably
+ screw things up.
+ 
+ Because of the FP and alloca limitations, the BinInterface is
+ completely disabled right now.
+ 
+ 
+ Demo
+ ----
+ 
+ This is a demo of the Ball & Larus version that does NOT use 2-level
+ profiling.
+ 
+ 1. Compile program with llvm-gcc.
+ 2. Run opt -lowerswitch -paths -emitfuncs on the bytecode.
+    -lowerswitch change switch statements to branches
+    -paths       Ball & Larus path-profiling algorithm
+    -emitfuncs   emit the table of functions
+ 3. Run llc to generate SPARC assembly code for the result of step 2.
+ 4. Use g++ to link the (instrumented) assembly code.
+ 
+ We use a script to do all this:
+ ------------------------------------------------------------------------------
+ #!/bin/sh
+ llvm-gcc $1.c -o $1
+ opt -lowerswitch -paths -emitfuncs $1.bc > $1.run.bc
+ llc -f $1.run.bc 
+ LIBS=$HOME/llvm_sparc/lib/Debug
+ GXX=/usr/dcs/software/evaluation/bin/g++
+ $GXX -g -L $LIBS $1.run.s -o $1.run.llc \
+ $LIBS/tracecache.o \
+ $LIBS/mapinfo.o \
+ $LIBS/trigger.o \
+ $LIBS/profpaths.o \
+ $LIBS/bininterface.o \
+ $LIBS/support.o \
+ $LIBS/vmcore.o \
+ $LIBS/transformutils.o \
+ $LIBS/bcreader.o \
+ -lscalaropts -lscalaropts -lanalysis \
+ -lmalloc -lcpc -lm -ldl
+ ------------------------------------------------------------------------------
+ 
+ 5. Run the resulting binary.  You will see output from BinInterface
+ (described below) intermixed with the output from the program.
+ 
+ 
+ Output from BinInterface
+ ------------------------
+ 
+ BinInterface's debugging code prints out the following stuff in order:
+ 
+ 1. Initial code provided to BinInterface with original register
+ allocation.
+ 
+ 2. Section 0 is the trace prolog, consisting mainly of live-ins and
+ register saves which will be restored in epilogs.
+ 
+ 3. Section 1 is the trace itself, in SSA form used by BinInterface,
+ along with the PHIs that are inserted.
+ PHIs are followed by the copies that implement them.
+ Each branch (i.e., out of the trace) is annotated with the
+ section number that represents the epilog it branches to.
+ 
+ 4. All the other sections starting with Section 2 are trace epilogs.
+ Every branch from the trace has to go to some epilog.
+ 
+ 5. After the last section is the register allocation output.