[llvm-commits] CVS: llvm/lib/Reoptimizer/Inst/ElfReader.h PerfInst.cpp design.txt
Joel Stanley
jstanley at cs.uiuc.edu
Mon Mar 31 11:39:01 PST 2003
Changes in directory llvm/lib/Reoptimizer/Inst:
ElfReader.h added (r1.1)
PerfInst.cpp updated: 1.2 -> 1.3
design.txt updated: 1.3 -> 1.4
---
Log message:
ElfReader initial checkin; client code invokes iterator method over functions in ELF symtab.
---
Diffs of the changes:
Index: llvm/lib/Reoptimizer/Inst/PerfInst.cpp
diff -u llvm/lib/Reoptimizer/Inst/PerfInst.cpp:1.2 llvm/lib/Reoptimizer/Inst/PerfInst.cpp:1.3
--- llvm/lib/Reoptimizer/Inst/PerfInst.cpp:1.2 Mon Mar 17 18:49:31 2003
+++ llvm/lib/Reoptimizer/Inst/PerfInst.cpp Mon Mar 31 11:48:34 2003
@@ -1,79 +1,40 @@
////////////////
// programmer: Joel Stanley
-// date: Mon Mar 3 13:34:13 CST 2003
+// date: Fri Mar 21 12:32:01 CST 2003
// fileid: PerfInst.cpp
// purpose: Provides code for performing performance instrumentation at
// runtime. The goal of the phase2 function is to implement Phase 2
// of the performance-oriented language extensions transformation. That is,
// it is responsible for replacing loads of particular global volatiles and
// stores of particular temporaries with appropriate calls to instrumentation
-// functions.
+// functions. More detail is given through the implementation.
// []
-#include "llvm/Reoptimizer/TraceCache.h"
-#include "llvm/Reoptimizer/VirtualMem.h"
-#include "llvm/Reoptimizer/InstrUtils.h"
-#include "llvm/Reoptimizer/GetTraceTime.h"
-#include "llvm/Reoptimizer/Mapping/LLVMinfo.h"
-#include "llvm/Bytecode/Reader.h"
-#include "llvm/Module.h"
-#include "llvm/iTerminators.h"
-#include "llvm/Support/CFG.h"
-
+#include <stdlib.h>
#include <iostream>
#include <fstream>
+#include <vector>
+
+#include "ElfReader.h"
using std::vector;
using std::cerr;
using std::endl;
-// Not sure if the following externs are required yet.
-extern int llvm_length;
-extern const unsigned char LLVMBytecode[];
-extern void** llvmFunctionTable[];
-
-extern "C" void phase2(int methodNum)
+extern "C" void phase2()
{
- static bool initialized = false;
- static Module* pMod = 0;
- static vector<Function*> funcList;
-
- cerr << "phase2 invoked" << endl;
+ cerr << "============================== Begin Phase 2 ==============================\n";
- if(!initialized) {
- initialized = true;
-
- cerr << "llvm_length is: " << llvm_length << endl;
-
- pMod = ParseBytecodeBuffer(LLVMBytecode, llvm_length);
- assert(pMod && "Couldn't parse Module");
-
- cerr << "Parsed bytecode" << endl;
-
- // Gather pointers to functions into funcList
- for(Module::iterator i = pMod->begin(), e = pMod->end(); i != e; ++i) {
- if(!i->isExternal())
- funcList.push_back(&*i);
- }
- }
+ const char* execName = getexecname();
+ cerr << "Executable name is: " << execName << endl;
- assert(pMod && "Module must have been parsed");
- assert(funcList[methodNum] && "Have not obtained methodNum'th function in funcList");
+ ElfReader elfReader(execName);
- cerr << "Dumping list of instructions in each function..." << endl;
-
- for(vector<Function*>::iterator i = funcList.begin(), e = funcList.end(); i != e; ++i) {
- cerr << "Processing function " << (*i)->getName() << endl;
- for(Function::iterator bbi = (*i)->begin(), bbe = (*i)->end(); bbi != bbe; ++bbi) {
- for(BasicBlock::iterator ii = bbi->begin(), ie = bbi->end(); ii != ie; ++ii) {
- cerr << "Processing instruction: " << *ii << endl;
- vector<uint64_t> vec = getLLVMInstrInfo(&*ii);
- cerr << "Obtained the following vector from getInstrInfo:" << endl;
- for(unsigned k = 0; k < vec.size(); ++k)
- cerr << vec[k] << endl;
- }
- }
+ std::string funcName;
+ ElfReader::AddressRange range;
+ while(elfReader.GetNextFunction(funcName, range)) {
+ cerr << "Function name is: " << funcName << endl;
}
-}
-
+ cerr << "============================== End Phase 2 ==============================\n";
+}
Index: llvm/lib/Reoptimizer/Inst/design.txt
diff -u llvm/lib/Reoptimizer/Inst/design.txt:1.3 llvm/lib/Reoptimizer/Inst/design.txt:1.4
--- llvm/lib/Reoptimizer/Inst/design.txt:1.3 Thu Mar 20 08:49:06 2003
+++ llvm/lib/Reoptimizer/Inst/design.txt Mon Mar 31 11:48:34 2003
@@ -455,9 +455,10 @@
{{{ MEETING MINUTES 20 Mar 2003
Agenda:
- - Address pending issues already sent via e-mail.
- - Confidence of approach, assurance of validity w.r.t time commitment.
- - Inlining of functions and how to handle
+
+ Things not sufficiently addressed in meeting or e-mail yet:
+
+ - Inlining of functions and how to handle
- Register schedule violation; or "how do we determine what registers should
hold values when insert code?". Rather, should we simply adopt the policy
@@ -467,28 +468,228 @@
analysis, at what point do we consider phase 2 "too expensive" when compared
with plain old opaque function calls at instrumentation points?
- - From the e-mail(s):
-
- (a) We have to balance the benefit of a vendor-independent implementation
- vs. the opportunity to do something "more conceptually novel" with the
- metrics.
-
- (b) We can discuss instrumenting functions at function entry; of course,
- this point is moot if we do not take the binary editing approach.
-
- What is the purpose of "exit stubs" in Trigger/TraceCache? What is the
role of the branch map and call map?
- As long as the new code fits within the 64KB segment, we have the
capability to add new code right?
-Minutes:
+ {{{ Minutes:
+
+For thesis: We want to obtain a good quality (but not neccessarily "production
+quality, fully-featured") tool. The design & implementation of such a tool *is*
+sufficient for the thesis, although the thesis would be significantly
+strengthened if we have "more conceptually novel" metrics in place, such as
+pipeline simulation/metrics. Talked about the feasibility of being done
+sometime in June, but potentially extending to July if need be. Work hard,
+Joel.
+
+Suggested experiment to consider the varieties of indirect branch problems that
+we'll encounter: write a large case statement (lots of code placed in each case,
+duplicated is okay) and examine closely the indirect branches in the
+compiler-generated code to determine how the table lookups occur (and if the
+jump table is storing absolute or relative addresses).
+
+Vendor independence is a significant conceptual result for the thesis, and it is
+not "all about using volatile". Rather, the key point to make is w.r.t. the
+approach taken: the approach is two-phase, and transformations are done both
+before the black-box compiler ever sees the code, and after it is done operating
+on the code. Also, note that vendor independence and "uninhibiting
+optimizations" _are_ competing goals (if we didn't care about uninhibiting
+optimizations, we could place opaque calls at the instrumentation points and
+look at them post-link, and if we didn't care about vendor independence we could
+do much better than the two-phase approach by using compiler annotations,
+etc)...our approach attempts to find a good compromise between the two goals.
+
+We want to make a "simple" (i.e. works in the common case) vendor-independent
+implementation.
+
+Can ensure that a function body remains after optimization (i.e. inlining) by
+printing the address of the function?
+
+New multi-phase approach discussed. We are not going to be using the LLVM
+mapping information, and so must rely on ELF mechanisms. We want these ELF
+mechanisms to be used at runtime, which requires mmapp-ing the executable or
+otherwise loading it from disk. We will probably have to create a reverse
+mapping of the symtable to go from address->function name (rather, function
+size) so that we can obtain information about the function being instrumented.
+The current plan is to transform each instrumented function on a demand-driven
+basis, wherein there'd be a call to the "phase2 transformation function" that
+would pass in the address of the enclosing function. [Actually, that approach
+does not work if the function gets inlined elsewhere, because modification of
+the function body's code will not result in the modification of the code at the
+inlined sites -- let's worry about this later].
+
+Talked about the need of a padding region for what is essentially a base
+trampoline, placed at the end of the function and within the range of a
+PC-relative jump (64K distance). Within this region of code, there'd need to be
+placed code to (indirect) jump to the copied region of code (i.e., the
+tracecache). Then, the start of the outermost interval is replaced with a
+short-jump (w/ annulling bit set) down into the padding area, which long jumps
+into the copied code region (which can be allocated on the heap now). This is
+all well and good *after* the transformation has been applied, but what about
+invoking the code that performs this transformation at runtime? Vikram's idea
+is that we can, in fact, do the exact same thing for the transformation itself,
+perhaps...more detail follows.
+
+Assume for the moment that we can locate the pad region easily, and that we can
+distinguish two subregions within it by a label or something. (This isn't true,
+but pretend that it is for now).
+
+...
+
+function entry point:
+ ...
+ ld volatile #1 [start of outermost interval] (***)
+ i2
+ ...
+ ld volatile #2 [end of outermost interval]
+ ...
+
+pad_start_1: padinst 1
+ padinst 2
+ padinst 3
+ ...
+
+pad_start_2: padinst k
+ padinst k+1
+ ...
+
+pad_end:
+ ... (stuff can be moved here by compiler)
+
+function end point: ...
+
+Now, _someone_ (who? a pre-pass of sorts?) writes over the start of the
+outermost interval, branching down to the pad_start_1 location. The pad
+instructions in the pad_start_1 region are over-written with a call to the
+"phase 2 transformation function" that performs all of the transformation on
+this particular function, *including* re-writing the branch at location (***)
+with a branch down to the pad_start_2 region, which contains code to perform the
+longjmp to the heap-allocated instrumented code.
+
+This might work; however, what is the benefit between this approach and simply
+having a _call_ (placed at function entry point above) to the so-called "phase 2
+transformation function"?
+
+ {{{ E-mail sent to adve on the subject
+
+The implementation approach described at the end of the e-mail is the one that
+I'm going to embark on. If you could provide me with your thoughts regarding
+the approach in general, I'd appreciate it, so I don't waste time writing code
+that we might throw away ;).
+
+Here's my assessment of the situation:
+
+Let F be a function that contains instrumentation. If F is inlined, we have the
+following concerns & proposed resolutions:
+
+ a) It is no longer identifiable by name. This has implications for locating
+instrumentation, as we discussed briefly in a previous e-mail.
+
+I *think* that this can be taken care of by (on program startup, one-time-only)
+processing the ELF symtable to construct a set of address ranges, and then
+(possibly at program startup, possibly on a demand-driven basis, depending on
+our implement approach) locate individual load instructions (using the "magic
+heuristic") and determine their enclosing functions by looking them up in the
+address range map. Can you think of a better approach or other problems that
+this doesn't address?
+
+ b) The body of F may now be enclosed in an inner loop, meaning that padding
+(i.e. a small for loop) placed at the end of F becomes the innermost loop in the
+calling function; this may prohibit optimizations. Likewise, if we place a
+function call (i.e., placed by phase 1 for doing the transformation at runtime)
+at the start of F (one of two approaches), an inner loop in the calling function
+now contains a (potentially opaque) function call, which may prohibit
+optimization.
+
+I think this is going to have to be caveat user: if you instrument functions
+that are considered lightweight enough to be inlined by the optimizing compiler,
+then you deal with the consequences. Is this unreasonable or too severe? We can
+minimize the prohibited optimization of a function call (to perform the runtime
+transformation) at the entry to F by using by not placing a function call there
+and using the approach you talked about earlier today (branching down to the
+end-of-function base trampoline).
+
+Current implementation sketch:
+
+[This approach focuses on doing the minimal amount of work (which is still a
+*lot*, I think) at program startup (i.e. "phase 2" using the new phase
+designations) and distributed work on-demand ("phase 3"). This should reduce the
+startup cost somewhat, but that's really the only reason that I see for doing
+it]
+
+0. Pad the end of each function that contains instrumentation.
+
+At program startup:
+
+1. mmap the executable (or whatever) and construct the address-range to function
+mapping information.
+
+2. For each function, find the load-volatile instructions that define interval
+and point metrics, and the starting locations of the pad region. At the entry to
+the padded region, place a call to the "phase 3 transformation function", and
+over-write the *first* instance of a load-volatile instruction (for either a
+point or an interval) with a direct branch/annulled delay slot to the start of
+the pad region.
+
+Execution continues. For those functions that are actually called, when
+execution reaches the point where instrumentation should be invoked for the
+first time, they get redirected to the base trampoline which calls the phase 3
+transformation function.
+
+The phase 3 transformation function:
+
+Does all the tracecache-like magic, copying the original code to a region of
+memory where the code can grow, rewriting the pad region so that it will execute
+the indirect jump to the new code region.
+
+Details of the last step are left intentionally opaque, if only because I don't
+know exactly what they entail yet. :P
+ }}}
+
+ }}}
}}}
{{{ IMPLEMENTATION SKETCH
+ {{{ Current implementation sketch:
+
+[This approach focuses on doing the minimal amount of work (which is still a
+*lot*, I think) at program startup (i.e. "phase 2" using the new phase
+designations) and distributed work on-demand ("phase 3"). This should reduce the
+startup cost somewhat, but that's really the only reason that I see for doing
+it]
+
+0. Pad the end of each function that contains instrumentation.
+
+At program startup:
+
+1. mmap the executable (or whatever) and construct the address-range to function
+mapping information.
+
+2. For each function, find the load-volatile instructions that define interval
+and point metrics, and the starting locations of the pad region. At the entry to
+the padded region, place a call to the "phase 3 transformation function", and
+over-write the *first* instance of a load-volatile instruction (for either a
+point or an interval) with a direct branch/annulled delay slot to the start of
+the pad region.
+
+Execution continues. For those functions that are actually called, when
+execution reaches the point where instrumentation should be invoked for the
+first time, they get redirected to the base trampoline which calls the phase 3
+transformation function.
+
+The phase 3 transformation function:
+
+Does all the tracecache-like magic, copying the original code to a region of
+memory where the code can grow, rewriting the pad region so that it will execute
+the indirect jump to the new code region.
+
+ }}}
+ {{{ Older implementation sketches:
At a high level, in broad sweeping strokes, we're going to use the
trace cache tool as a framework for runtime manipulation of the binary
code. That is, the framework provided by the tracecache allows the
@@ -579,7 +780,7 @@
debugging information (take the address of it in a function, load that
address, look for the address, verify?).
-See TODO list below.
+ }}}
}}}
@@ -591,6 +792,73 @@
{{{ TODO
+- Read EEL paper to get a better feel for binary modification issues
+
+- Use the existing mechanisms at your disposal
+ (ELF/tracecache/BinInterface/VirtualMem/etc) to do the following.
+
+ In phase 1:
+
+ Complete the remainder of the phase-1 actions: building the GBT, handling
+ the sigfuns properly (i.e. adding a pair-of-sigfuns mechanism even for
+ point metrics), compare against by-hand example for phase 1 actions, etc.
+
+ At the end of each instrumented function, immutably pad with a large
+ enough pad region. {Propose doing this as a for loop containing immutable
+ loads}
+
+ On program startup ("phase 2" function called from main()):
+
+ [check] mmap or otherwise load the ELF representation of the program and
+ acquire an ELF descriptor (etc) that will be persistent throughout the
+ program's execution.
+
+ Collect address ranges for all functions, so that when a particular
+ load-volatile instruction is encountered, it can be determined what
+ function it ended up being in. I think that these should be the same
+ virtual addresses as seen within the context of the executing code, but
+ this should be verified.
+
+ ^^^ At this point, the application should be running and, at RUNTIME, spit
+ out (at the very least) the function boundary addresses; preferably, it
+ can spit out the BinInterface-obtained disassembly as well so that we can
+ compare it against the static disassembly.
+
+ For each function, locate the load-volatile instructions that define
+ interval and point metrics (potentially recording some information about
+ them for later use); also find the padding region at the end of the
+ function (this may be hard). Write code into the padding region to call
+ the "phase 3 transformation function", and over-write the *first*
+ load-volatile in the function that corresponds to an instrumentation point
+ (or interval start point) with a direct branch down to the padded region.
+
+ Vikram's comment on this last step:
+
+ [Finding "the first" load-volatile in the function is not easy because of
+ control-flow. Furthermore, I don't think Step 2 needs to find
+ load-volatiles for actual instrumentations at all since many functions may
+ never be executed. We should leave that to step 3.
+
+ Therefore, I would simplify as follows:
+
+ For each function, find the load-volatile instructions that define the
+ entry of the padded region. Over-write the first instruction of the
+ function with a direct branch to a trampoline in the padded region. This
+ trampoline executes the first instruction and then calls the Phase 3
+ routine to instrument the function.]
+
+ On phase 3 transformation function invocation:
+
+ Performs all of tracecache-like magic, copying the original code to a
+ region of memory where the code can grow, rewriting the pad region so that
+ it will execute the indirect jump to the new code region, etc. The
+ majority of the actions required here are still fairly unclear. To
+ accomplish this step, we must first determine how to make the branch- and
+ call-maps that the TraceCache addTrace() routine(s) require, and how to
+ otherwise use the existing tracecache stuff to accomplish what we want.
+}}}
+
+{{{ COMPLETED TODO ITEMS
- Answer the following questions about the tracecache:
{{{
@@ -618,41 +886,7 @@
because all of the contextual information about a particular function is
obtained via the LLVM mapping information.
- - Perform the following experiement to help answer these questions:
-
- Use the tracecache/BinInterface/VirtualMem/etc mechanisms as they
- currently exist, together with te ELF library and phase 1, to do the
- following:
-
- Insert a call to our phase2 function in main; the phase2 function will
- be responsible for doing all of the binary analysis and
- transformations.
-
- For using ELF mechanisms that we need to use, determine how the
- tracecache is currently (if it is) mmap'ing the executable, and how to
- direct the ELF library to use the executable image in memory instead
- of loading it from disk.
-
- Given the name of a function that exists in the ELF object file,
- obtain its starting and ending address _in the address space of the
- running application_.
-
- ^^^ At this point, the application should be running and, at RUNTIME,
- spit out (at the very least) the function boundary addresses;
- preferably, it can spit out the BinInterface-obtained disassembly as
- well so that we can compare it against the static disassembly.
-
- Copy this address region to the cache and reroute execution,
- preferably modifying some code in the cache so that the rerouted
- execution is apparent during execution. [This step is really the key
- investigatory point: do we need to access the LLVM-bytecode CFG to do
- this? Does the copy mechanism only support a copy of a specified path
- into the cache, or will it operate on an arbitrary CFG/CFG subgraph?]
-
}}}
-
-- Read EEL paper to get a better feel for binary modification issues
-
}}}
{{{ BY-HAND EXAMPLE OF PHASE ACTIONS
More information about the llvm-commits
mailing list