[llvm-commits] CVS: llvm/lib/Reoptimizer/Inst/ElfReader.h PerfInst.cpp design.txt

Mon Mar 31 11:39:01 PST 2003

Changes in directory llvm/lib/Reoptimizer/Inst:

ElfReader.h added (r1.1)
PerfInst.cpp updated: 1.2 -> 1.3
design.txt updated: 1.3 -> 1.4

---
Log message:

ElfReader initial checkin; client code invokes iterator method over functions in ELF symtab.


---
Diffs of the changes:

Index: llvm/lib/Reoptimizer/Inst/PerfInst.cpp
diff -u llvm/lib/Reoptimizer/Inst/PerfInst.cpp:1.2 llvm/lib/Reoptimizer/Inst/PerfInst.cpp:1.3

--- llvm/lib/Reoptimizer/Inst/PerfInst.cpp:1.2	Mon Mar 17 18:49:31 2003
+++ llvm/lib/Reoptimizer/Inst/PerfInst.cpp	Mon Mar 31 11:48:34 2003
@@ -1,79 +1,40 @@
 ////////////////
 // programmer: Joel Stanley
-//       date: Mon Mar  3 13:34:13 CST 2003
+//       date: Fri Mar 21 12:32:01 CST 2003
 //     fileid: PerfInst.cpp
 //    purpose: Provides code for performing performance instrumentation at
 //             runtime.  The goal of the phase2 function is to implement Phase 2
 //             of the performance-oriented language extensions transformation.  That is,
 //             it is responsible for replacing loads of particular global volatiles and
 //             stores of particular temporaries with appropriate calls to instrumentation
-//             functions.
+//             functions.  More detail is given through the implementation.
 // []
 
-#include "llvm/Reoptimizer/TraceCache.h"
-#include "llvm/Reoptimizer/VirtualMem.h"
-#include "llvm/Reoptimizer/InstrUtils.h"
-#include "llvm/Reoptimizer/GetTraceTime.h"
-#include "llvm/Reoptimizer/Mapping/LLVMinfo.h"
-#include "llvm/Bytecode/Reader.h"
-#include "llvm/Module.h"
-#include "llvm/iTerminators.h"
-#include "llvm/Support/CFG.h"
-
+#include <stdlib.h>
 #include <iostream>
 #include <fstream>
+#include <vector>
+
+#include "ElfReader.h"
 
 using std::vector;
 using std::cerr;
 using std::endl;
 
-// Not sure if the following externs are required yet.
-extern int llvm_length;
-extern const unsigned char LLVMBytecode[];
-extern void** llvmFunctionTable[];
-
-extern "C" void phase2(int methodNum) 
+extern "C" void phase2() 
 {
-    static bool initialized = false;
-    static Module* pMod = 0;
-    static vector<Function*> funcList;
-
-    cerr << "phase2 invoked" << endl;
+    cerr << "============================== Begin Phase 2 ==============================\n";
     
-    if(!initialized) {
-        initialized = true;
-        
-        cerr << "llvm_length is: " << llvm_length << endl;
-
-        pMod = ParseBytecodeBuffer(LLVMBytecode, llvm_length);
-        assert(pMod && "Couldn't parse Module");
-
-        cerr << "Parsed bytecode" << endl;
-
-        // Gather pointers to functions into funcList
-        for(Module::iterator i = pMod->begin(), e = pMod->end(); i != e; ++i) {
-            if(!i->isExternal())
-                funcList.push_back(&*i);
-        }
-    }
+    const char* execName = getexecname();
+    cerr << "Executable name is: " << execName << endl;
 
-    assert(pMod && "Module must have been parsed");
-    assert(funcList[methodNum] && "Have not obtained methodNum'th function in funcList");
+    ElfReader elfReader(execName);
 
-    cerr << "Dumping list of instructions in each function..." << endl;
-
-    for(vector<Function*>::iterator i = funcList.begin(), e = funcList.end(); i != e; ++i) {
-        cerr << "Processing function " << (*i)->getName() << endl;
-        for(Function::iterator bbi = (*i)->begin(), bbe = (*i)->end(); bbi != bbe; ++bbi) {
-            for(BasicBlock::iterator ii = bbi->begin(), ie = bbi->end(); ii != ie; ++ii) {
-                cerr << "Processing instruction: " << *ii << endl;
-                vector<uint64_t> vec = getLLVMInstrInfo(&*ii);
-                cerr << "Obtained the following vector from getInstrInfo:" << endl;
-                for(unsigned k = 0; k < vec.size(); ++k)
-                    cerr << vec[k] << endl;
-            }
-        }
+    std::string funcName;
+    ElfReader::AddressRange range;
+    while(elfReader.GetNextFunction(funcName, range)) {
+        cerr << "Function name is: " << funcName << endl;
     }
-}
-
     
+    cerr << "============================== End Phase 2 ==============================\n";    
+}


Index: llvm/lib/Reoptimizer/Inst/design.txt
diff -u llvm/lib/Reoptimizer/Inst/design.txt:1.3 llvm/lib/Reoptimizer/Inst/design.txt:1.4
--- llvm/lib/Reoptimizer/Inst/design.txt:1.3	Thu Mar 20 08:49:06 2003
+++ llvm/lib/Reoptimizer/Inst/design.txt	Mon Mar 31 11:48:34 2003
@@ -455,9 +455,10 @@
 {{{ MEETING MINUTES 20 Mar 2003
 
 Agenda:
-    - Address pending issues already sent via e-mail.
-      - Confidence of approach, assurance of validity w.r.t time commitment.
-      - Inlining of functions and how to handle
+
+    Things not sufficiently addressed in meeting or e-mail yet:
+
+    - Inlining of functions and how to handle
 
     - Register schedule violation; or "how do we determine what registers should
     hold values when insert code?".  Rather, should we simply adopt the policy
@@ -467,28 +468,228 @@
     analysis, at what point do we consider phase 2 "too expensive" when compared
     with plain old opaque function calls at instrumentation points?
 
-    - From the e-mail(s): 
-
-      (a) We have to balance the benefit of a vendor-independent implementation
-      vs. the opportunity to do something "more conceptually novel" with the
-      metrics.
-
-      (b) We can discuss instrumenting functions at function entry; of course,
-      this point is moot if we do not take the binary editing approach.
-
     - What is the purpose of "exit stubs" in Trigger/TraceCache? What is the
     role of the branch map and call map?
 
     - As long as the new code fits within the 64KB segment, we have the
     capability to add new code right?
 
-Minutes:
+    {{{ Minutes:
+
+For thesis: We want to obtain a good quality (but not neccessarily "production
+quality, fully-featured") tool.  The design & implementation of such a tool *is*
+sufficient for the thesis, although the thesis would be significantly
+strengthened if we have "more conceptually novel" metrics in place, such as
+pipeline simulation/metrics.  Talked about the feasibility of being done
+sometime in June, but potentially extending to July if need be.  Work hard,
+Joel.
+
+Suggested experiment to consider the varieties of indirect branch problems that
+we'll encounter: write a large case statement (lots of code placed in each case,
+duplicated is okay) and examine closely the indirect branches in the
+compiler-generated code to determine how the table lookups occur (and if the
+jump table is storing absolute or relative addresses).
+
+Vendor independence is a significant conceptual result for the thesis, and it is
+not "all about using volatile".  Rather, the key point to make is w.r.t. the
+approach taken: the approach is two-phase, and transformations are done both
+before the black-box compiler ever sees the code, and after it is done operating
+on the code.  Also, note that vendor independence and "uninhibiting
+optimizations" _are_ competing goals (if we didn't care about uninhibiting
+optimizations, we could place opaque calls at the instrumentation points and
+look at them post-link, and if we didn't care about vendor independence we could
+do much better than the two-phase approach by using compiler annotations,
+etc)...our approach attempts to find a good compromise between the two goals.
+
+We want to make a "simple" (i.e. works in the common case) vendor-independent
+implementation.
+
+Can ensure that a function body remains after optimization (i.e. inlining) by
+printing the address of the function?
+
+New multi-phase approach discussed.  We are not going to be using the LLVM
+mapping information, and so must rely on ELF mechanisms. We want these ELF
+mechanisms to be used at runtime, which requires mmapp-ing the executable or
+otherwise loading it from disk.  We will probably have to create a reverse
+mapping of the symtable to go from address->function name (rather, function
+size) so that we can obtain information about the function being instrumented.
+The current plan is to transform each instrumented function on a demand-driven
+basis, wherein there'd be a call to the "phase2 transformation function" that
+would pass in the address of the enclosing function. [Actually, that approach
+does not work if the function gets inlined elsewhere, because modification of
+the function body's code will not result in the modification of the code at the
+inlined sites -- let's worry about this later].
+
+Talked about the need of a padding region for what is essentially a base
+trampoline, placed at the end of the function and within the range of a
+PC-relative jump (64K distance).  Within this region of code, there'd need to be
+placed code to (indirect) jump to the copied region of code (i.e., the
+tracecache).  Then, the start of the outermost interval is replaced with a
+short-jump (w/ annulling bit set) down into the padding area, which long jumps
+into the copied code region (which can be allocated on the heap now).  This is
+all well and good *after* the transformation has been applied, but what about
+invoking the code that performs this transformation at runtime?  Vikram's idea
+is that we can, in fact, do the exact same thing for the transformation itself,
+perhaps...more detail follows.
+
+Assume for the moment that we can locate the pad region easily, and that we can
+distinguish two subregions within it by a label or something. (This isn't true,
+but pretend that it is for now).  
+
+...
+
+function entry point:
+                      ...
+                      ld volatile #1 [start of outermost interval] (***)
+                      i2
+                      ...
+                      ld volatile #2 [end of outermost interval]
+                      ...
+
+pad_start_1:          padinst 1
+                      padinst 2
+                      padinst 3
+                      ...
+
+pad_start_2:          padinst k
+                      padinst k+1
+                      ...
+
+pad_end:
+                      ... (stuff can be moved here by compiler)
+
+function end point:   ...
+
+Now, _someone_ (who? a pre-pass of sorts?) writes over the start of the
+outermost interval, branching down to the pad_start_1 location.  The pad
+instructions in the pad_start_1 region are over-written with a call to the
+"phase 2 transformation function" that performs all of the transformation on
+this particular function, *including* re-writing the branch at location (***)
+with a branch down to the pad_start_2 region, which contains code to perform the
+longjmp to the heap-allocated instrumented code.
+
+This might work; however, what is the benefit between this approach and simply
+having a _call_ (placed at function entry point above) to the so-called "phase 2
+transformation function"?
+
+    {{{ E-mail sent to adve on the subject
+
+The implementation approach described at the end of the e-mail is the one that
+I'm going to embark on.  If you could provide me with your thoughts regarding
+the approach in general, I'd appreciate it, so I don't waste time writing code
+that we might throw away ;).
+
+Here's my assessment of the situation:
+
+Let F be a function that contains instrumentation.  If F is inlined, we have the
+following concerns & proposed resolutions:
+
+   a) It is no longer identifiable by name. This has implications for locating
+instrumentation, as we discussed briefly in a previous e-mail.
+
+I *think* that this can be taken care of by (on program startup, one-time-only)
+processing the ELF symtable to construct a set of address ranges, and then
+(possibly at program startup, possibly on a demand-driven basis, depending on
+our implement approach) locate individual load instructions (using the "magic
+heuristic") and determine their enclosing functions by looking them up in the
+address range map.  Can you think of a better approach or other problems that
+this doesn't address?
+
+   b) The body of F may now be enclosed in an inner loop, meaning that padding
+(i.e. a small for loop) placed at the end of F becomes the innermost loop in the
+calling function; this may prohibit optimizations.  Likewise, if we place a
+function call (i.e., placed by phase 1 for doing the transformation at runtime)
+at the start of F (one of two approaches), an inner loop in the calling function
+now contains a (potentially opaque) function call, which may prohibit
+optimization.
+
+I think this is going to have to be caveat user: if you instrument functions
+that are considered lightweight enough to be inlined by the optimizing compiler,
+then you deal with the consequences. Is this unreasonable or too severe? We can
+minimize the prohibited optimization of a function call (to perform the runtime
+transformation) at the entry to F by using by not placing a function call there
+and using the approach you talked about earlier today (branching down to the
+end-of-function base trampoline).
+
+Current implementation sketch:
+
+[This approach focuses on doing the minimal amount of work (which is still a
+*lot*, I think) at program startup (i.e. "phase 2" using the new phase
+designations) and distributed work on-demand ("phase 3"). This should reduce the
+startup cost somewhat, but that's really the only reason that I see for doing
+it]
+
+0. Pad the end of each function that contains instrumentation.
+
+At program startup:
+
+1. mmap the executable (or whatever) and construct the address-range to function
+mapping information.
+
+2. For each function, find the load-volatile instructions that define interval
+and point metrics, and the starting locations of the pad region. At the entry to
+the padded region, place a call to the "phase 3 transformation function", and
+over-write the *first* instance of a load-volatile instruction (for either a
+point or an interval) with a direct branch/annulled delay slot to the start of
+the pad region.
+
+Execution continues.  For those functions that are actually called, when
+execution reaches the point where instrumentation should be invoked for the
+first time, they get redirected to the base trampoline which calls the phase 3
+transformation function. 
+
+The phase 3 transformation function:
+
+Does all the tracecache-like magic, copying the original code to a region of
+memory where the code can grow, rewriting the pad region so that it will execute
+the indirect jump to the new code region.
+
+Details of the last step are left intentionally opaque, if only because I don't
+know exactly what they entail yet. :P
 
+    }}}
+
+    }}}
 
 }}}
 
 {{{ IMPLEMENTATION SKETCH
 
+    {{{ Current implementation sketch:
+
+[This approach focuses on doing the minimal amount of work (which is still a
+*lot*, I think) at program startup (i.e. "phase 2" using the new phase
+designations) and distributed work on-demand ("phase 3"). This should reduce the
+startup cost somewhat, but that's really the only reason that I see for doing
+it]
+
+0. Pad the end of each function that contains instrumentation.
+
+At program startup:
+
+1. mmap the executable (or whatever) and construct the address-range to function
+mapping information.
+
+2. For each function, find the load-volatile instructions that define interval
+and point metrics, and the starting locations of the pad region. At the entry to
+the padded region, place a call to the "phase 3 transformation function", and
+over-write the *first* instance of a load-volatile instruction (for either a
+point or an interval) with a direct branch/annulled delay slot to the start of
+the pad region.
+
+Execution continues.  For those functions that are actually called, when
+execution reaches the point where instrumentation should be invoked for the
+first time, they get redirected to the base trampoline which calls the phase 3
+transformation function. 
+
+The phase 3 transformation function:
+
+Does all the tracecache-like magic, copying the original code to a region of
+memory where the code can grow, rewriting the pad region so that it will execute
+the indirect jump to the new code region.
+
+    }}}
+    {{{ Older implementation sketches:
 At a high level, in broad sweeping strokes, we're going to use the
 trace cache tool as a framework for runtime manipulation of the binary
 code.  That is, the framework provided by the tracecache allows the
@@ -579,7 +780,7 @@
 debugging information (take the address of it in a function, load that
 address, look for the address, verify?).
 
-See TODO list below.
+    }}}
 
 }}}
 
@@ -591,6 +792,73 @@
 
 {{{ TODO
 
+- Read EEL paper to get a better feel for binary modification issues
+
+- Use the existing mechanisms at your disposal
+  (ELF/tracecache/BinInterface/VirtualMem/etc) to do the following.
+
+  In phase 1:
+
+      Complete the remainder of the phase-1 actions: building the GBT, handling
+      the sigfuns properly (i.e. adding a pair-of-sigfuns mechanism even for
+      point metrics), compare against by-hand example for phase 1 actions, etc.
+
+      At the end of each instrumented function, immutably pad with a large
+      enough pad region. {Propose doing this as a for loop containing immutable
+      loads}
+ 
+  On program startup ("phase 2" function called from main()):
+
+      [check] mmap or otherwise load the ELF representation of the program and
+      acquire an ELF descriptor (etc) that will be persistent throughout the
+      program's execution.
+
+      Collect address ranges for all functions, so that when a particular
+      load-volatile instruction is encountered, it can be determined what
+      function it ended up being in.  I think that these should be the same
+      virtual addresses as seen within the context of the executing code, but
+      this should be verified.
+
+      ^^^ At this point, the application should be running and, at RUNTIME, spit
+      out (at the very least) the function boundary addresses; preferably, it
+      can spit out the BinInterface-obtained disassembly as well so that we can
+      compare it against the static disassembly.
+
+      For each function, locate the load-volatile instructions that define
+      interval and point metrics (potentially recording some information about
+      them for later use); also find the padding region at the end of the
+      function (this may be hard).  Write code into the padding region to call
+      the "phase 3 transformation function", and over-write the *first*
+      load-volatile in the function that corresponds to an instrumentation point
+      (or interval start point) with a direct branch down to the padded region.
+
+      Vikram's comment on this last step:
+
+      [Finding "the first" load-volatile in the function is not easy because of
+      control-flow.  Furthermore, I don't think Step 2 needs to find
+      load-volatiles for actual instrumentations at all since many functions may
+      never be executed.  We should leave that to step 3.
+
+      Therefore, I would simplify as follows:
+
+      For each function, find the load-volatile instructions that define the
+      entry of the padded region.  Over-write the first instruction of the
+      function with a direct branch to a trampoline in the padded region.  This
+      trampoline executes the first instruction and then calls the Phase 3
+      routine to instrument the function.]
+
+  On phase 3 transformation function invocation:
+
+      Performs all of tracecache-like magic, copying the original code to a
+      region of memory where the code can grow, rewriting the pad region so that
+      it will execute the indirect jump to the new code region, etc.  The
+      majority of the actions required here are still fairly unclear.  To
+      accomplish this step, we must first determine how to make the branch- and
+      call-maps that the TraceCache addTrace() routine(s) require, and how to
+      otherwise use the existing tracecache stuff to accomplish what we want.
+}}}
+
+{{{ COMPLETED TODO ITEMS 
 - Answer the following questions about the tracecache:
     {{{ 
 
@@ -618,41 +886,7 @@
       because all of the contextual information about a particular function is
       obtained via the LLVM mapping information.
 
-    - Perform the following experiement to help answer these questions:
-
-      Use the tracecache/BinInterface/VirtualMem/etc mechanisms as they
-      currently exist, together with te ELF library and phase 1, to do the
-      following:
-
-          Insert a call to our phase2 function in main; the phase2 function will
-          be responsible for doing all of the binary analysis and
-          transformations.
-
-          For using ELF mechanisms that we need to use, determine how the
-          tracecache is currently (if it is) mmap'ing the executable, and how to
-          direct the ELF library to use the executable image in memory instead
-          of loading it from disk.
-
-          Given the name of a function that exists in the ELF object file,
-          obtain its starting and ending address _in the address space of the
-          running application_.
-
-	  ^^^ At this point, the application should be running and, at RUNTIME,
-	  spit out (at the very least) the function boundary addresses;
-	  preferably, it can spit out the BinInterface-obtained disassembly as
-	  well so that we can compare it against the static disassembly.
-
-          Copy this address region to the cache and reroute execution,
-          preferably modifying some code in the cache so that the rerouted
-          execution is apparent during execution.  [This step is really the key
-          investigatory point: do we need to access the LLVM-bytecode CFG to do
-          this? Does the copy mechanism only support a copy of a specified path
-          into the cache, or will it operate on an arbitrary CFG/CFG subgraph?]
-
     }}}
-
-- Read EEL paper to get a better feel for binary modification issues
-
 }}}
 
 {{{ BY-HAND EXAMPLE OF PHASE ACTIONS