[llvm-commits] CVS: llvm/lib/Reoptimizer/Inst/design.txt

Thu Mar 20 08:42:01 PST 2003

Changes in directory llvm/lib/Reoptimizer/Inst:

design.txt updated: 1.2 -> 1.3

---
Log message:



---
Diffs of the changes:

Index: llvm/lib/Reoptimizer/Inst/design.txt
diff -u llvm/lib/Reoptimizer/Inst/design.txt:1.2 llvm/lib/Reoptimizer/Inst/design.txt:1.3

--- llvm/lib/Reoptimizer/Inst/design.txt:1.2	Mon Mar 17 18:49:31 2003
+++ llvm/lib/Reoptimizer/Inst/design.txt	Thu Mar 20 08:49:06 2003
@@ -1,4 +1,4 @@
-{{{ OVERALL GOALS OF PHASE 2
+{{{ OVERALL GOALS OF PHASE 2 AND GENERAL STUFF
 
   - identify all loads of global volatile variables and the
     corresponding stores to temporaries
@@ -17,7 +17,6 @@
 
   Optimization in the compiler is on. This means that we can't really
   rely on a particular "signature" of generated assembly code.
-
 }}}
 {{{ Problems:
 
@@ -59,9 +58,89 @@
   analysis than we want to do (and potentially, not even then!), so we don't
   overwrite address arithmetic instructions, etc. 
 
-  }}} 
+  }}}  
+  {{{ Inlined functions (not answered)
+
+  How do we best deal with functions that are inlined by the black-box
+  compiler? In particular, the naive approach of recording (in the
+  global bookkeeping table or GBT) the names of instrumented functions
+  in phase 1 so that they can be looked up via the ELF symtable in
+  phase 2 doesn't work if the instrumented functions got inlined.  We
+  could try a 2-step approach to finding the function bodies that
+  contain instrumentation points: 1) For each function name in the
+  GBT, look it up in the ELF symtab; if present, done, otherwise try
+  step 2.  2) Scan the entire program for load-volatile instructions,
+  obtain the address of those instructions, and then find out what
+  function body address interval contains that address.  This approach
+  seems like a lot of work, but might just do it.
+
+  The previous approach, which may or may not be a viable approach,
+  assumes that we can actually obtain the GBT contents.  Our previous
+  plan for obtaining the GBT base address was to take its address in
+  the documentation function pre-compilation, then (post-compilation)
+  look up the documentation function (by name), parse the load
+  instruction (or however the GBT address was witnessed), and obtain
+  the GBT.  However, this approach is flawed since the documentation
+  function might get removed (if it's dead) or inlined (if it's
+  called).  Perhaps the address of the GBT should just be taken in a
+  non-dead way at the entry to main itself? A write of the GBT address
+  to a volatile global (yet another one!) should ensure that the copy
+  isn't removed.  
+
+  Another approach for finding the GBT base addr (since we're
+  operating exclusively on ELF) is to simply look it up in the ELF
+  symtab.  This will work because the static structure contents won't
+  be able to be eliminated if the struct is global, since other
+  compilation units may refer to it directly using extern.  However,
+  the linker itself may prevent it from being included in the final
+  executable if there are no references to it. Perhaps we can
+  introduce a benign use of the GBT (taking it address and storing the
+  result into a global volatile) simply to ensure that there is *a*
+  reference to the structure.  E-mail Chris about this.
+
+  {{{ Response from Chris
+
+    > If I've got a global statically-initialized struct that isn't used
+    > anywhere within its compliation unit, any compiler wouldn't be able to
+    > remove it because some other compilation unit may refer to it in an extern
+    > manner, correct?
+    
+    True, unless it's declared static.
+    
+    > However, if the above is the case, *and* no other compilation unit refers
+    > to the struct in an extern manner, a clever linker would be able to delete
+    > the structure because nothing needed to bind to the symbol.  Right?
+    
+    Yes, or simple IPO.
+    
+    > So...I need to introduce a global struct that can't be removed by the
+    > compiler or linker, without changing the semantics of the program.  Then I
+    > need to read the contents of the struct directly from the ELF executable
+    > after looking up its name in the ELF symbol table.  The way I'm currently
+    > planning on doing this is by inserting an un-removable & benign reference
+    > to the global struct in, say, main() so that a clever linker can't remove
+    > it.  Does this sound lame? :)
+    
+    That should work.  Note that normal linkers won't delete these structure
+    references, so it may not even be a problem unless you're trying to be more
+    portable...
+
+    -Chris
+
+  }}}
+
+  KIS concession: Grab the base address of the GBT directly from the ELF symtab,
+  and worry about it getting deleted if/when that actually occurs.
 
-}}} 
+  }}}
+  {{{ Violation of register schedules (issue?)
+
+  What about violation of register schedules when inserting new code?
+  Is this even an issue? 
+
+  }}}
+
+}}}
 {{{ Musings on trampolines:
 
 - Assuming that we can leave the address calculation in place for the
@@ -149,8 +228,8 @@
 instructions would be required to pack the register with the target
 address.
 
-}}}
-{{{ Trampoline-related ideas:
+
+Trampoline-related ideas:
 
 (Thanks Brian!) :)
 
@@ -371,7 +450,44 @@
 didn't use arbitrary instrumentation points, we can do *exactly* what
 MDL can do (we think) without source access.
 
-Implementation sketch:
+}}}
+
+{{{ MEETING MINUTES 20 Mar 2003
+
+Agenda:
+    - Address pending issues already sent via e-mail.
+      - Confidence of approach, assurance of validity w.r.t time commitment.
+      - Inlining of functions and how to handle
+
+    - Register schedule violation; or "how do we determine what registers should
+    hold values when insert code?".  Rather, should we simply adopt the policy
+    of 'always spill' or is doing otherwise an optimization that should be
+    considered later?  In particular, if we always spill, AND can't remove the
+    address arithmetic instructions (for volatile temps) without more robust
+    analysis, at what point do we consider phase 2 "too expensive" when compared
+    with plain old opaque function calls at instrumentation points?
+
+    - From the e-mail(s): 
+
+      (a) We have to balance the benefit of a vendor-independent implementation
+      vs. the opportunity to do something "more conceptually novel" with the
+      metrics.
+
+      (b) We can discuss instrumenting functions at function entry; of course,
+      this point is moot if we do not take the binary editing approach.
+
+    - What is the purpose of "exit stubs" in Trigger/TraceCache? What is the
+    role of the branch map and call map?
+
+    - As long as the new code fits within the 64KB segment, we have the
+    capability to add new code right?
+
+Minutes:
+
+
+}}}
+
+{{{ IMPLEMENTATION SKETCH
 
 At a high level, in broad sweeping strokes, we're going to use the
 trace cache tool as a framework for runtime manipulation of the binary
@@ -469,30 +585,222 @@
 
 {{{ MILESTONES
 
-- Extract and report bookkeeping data structure contents from raw
-compiled binary.
-
-- Determine if/how the tracecache framework can be used for a CFG
-subgraph "copy" to a new area of memory; determine whether or not it's
-worth the effort or whether it should be "done from scratch".
+- Perform the "tracecache experiment" described in the TODO section.
 
 }}}
 
 {{{ TODO
 
+- Answer the following questions about the tracecache:
+    {{{ 
+
+    - To what extent does it use the LLVM bytecode and/or mapping information 
+      to map a particular path into the cache?
+
+      It appears that the code in the TraceCache object itself doesn't require
+      any of the LLVM mapping information.  However, as inputs to addTrace(), it
+      does need a "call map", a "branch map", and a vector called
+      "exitStubs". I'm not clear what the exit stubs are for yet, exactly, nor
+      the precise role of the call/branch maps (although I think they are just
+      the redirected branch destinations or some such thing).  The trigger
+      routine *does* use the LLVM mapping information to construct these maps,
+      so it may be difficult to determine how to form the maps without the
+      specific mapping information...but it might be possible.
+
+    - What kind of modifications would be needed to map an entire function body
+      into the tracecache region such that "hot paths" weren't considered and
+      path activity wasn't tracked?  What kind of dependence does this induce on
+      the LLVM mapping and/or bytecode representation?
+
+      Good news: the "hot path" and LLVM specific stuff seems confined to the
+      trigger routine.  The TraceCache class itself seems to operate on raw
+      instruction ranges, etc.  NB: No mmaping of the executable is performed
+      because all of the contextual information about a particular function is
+      obtained via the LLVM mapping information.
+
+    - Perform the following experiement to help answer these questions:
+
+      Use the tracecache/BinInterface/VirtualMem/etc mechanisms as they
+      currently exist, together with te ELF library and phase 1, to do the
+      following:
+
+          Insert a call to our phase2 function in main; the phase2 function will
+          be responsible for doing all of the binary analysis and
+          transformations.
+
+          For using ELF mechanisms that we need to use, determine how the
+          tracecache is currently (if it is) mmap'ing the executable, and how to
+          direct the ELF library to use the executable image in memory instead
+          of loading it from disk.
+
+          Given the name of a function that exists in the ELF object file,
+          obtain its starting and ending address _in the address space of the
+          running application_.
+
+	  ^^^ At this point, the application should be running and, at RUNTIME,
+	  spit out (at the very least) the function boundary addresses;
+	  preferably, it can spit out the BinInterface-obtained disassembly as
+	  well so that we can compare it against the static disassembly.
+
+          Copy this address region to the cache and reroute execution,
+          preferably modifying some code in the cache so that the rerouted
+          execution is apparent during execution.  [This step is really the key
+          investigatory point: do we need to access the LLVM-bytecode CFG to do
+          this? Does the copy mechanism only support a copy of a specified path
+          into the cache, or will it operate on an arbitrary CFG/CFG subgraph?]
+
+    }}}
+
 - Read EEL paper to get a better feel for binary modification issues
-- Do sample by hand and revisit actions of both phases
-- Extract bookkeeping data structure contents, function stats/ends,
-  etc, using low-level POSIX/ELF mechanisms.
 
 }}}
 
-{{{ PENDING QUESTIONS
+{{{ BY-HAND EXAMPLE OF PHASE ACTIONS
 
-[What about violation of register schedules when inserting new code?
-Is this an issue?]
+     {{{ High-level code (i.e. no sigfuns):
 
-}}}
+pp_interval<bounded_series, elapsedTimeStart, elapsedTimeEnd, size=20> eth;
+
+void bar() {
+  int cnt = 0;
+
+  {
+    sample eth;
+    while(cnt++ != 15) {
+      foo();
+      printf(...);
+    }
+  }
+  ...
+  printf("avg reading was %f\n", pp_avg(eth));
+}
+
+     }}}
+     {{{ Sigfun-level code (input to phase 1)
+
+void main() {
+  pp_interval("eth", elapsedTimeStart, elapsedTimeEnd, "bounded_series", "size=20");
+}
+
+[[The processing of pp_interval call in main() results in declaration:
+  double eth[20]; 
+which is used by name elsewhere...
+]]
+
+void bar() {
+  int cnt = 0;
+    
+  {
+    pp_sigfun_interval_start("eth", elapsedTimeStart);
+
+    while(cnt++ != 15) {
+      foo();
+      printf(...);
+    }
+    
+    pp_sigfun_interval_end("eth", elapsedTimeEnd);
+  }
+  ...
+  printf("avg reading was %f\n", pp_avg(eth));
+}
+
+     }}}
+     {{{ Post-phase1 code (quasi-high-level)
+
+struct GBT {
+  // fields for GBT go here...
+} the_gbt = { initializer };
+
+volatile global instSite1;     // instSite1 = start of region
+volatile global instSite1_tmp; 
+volatile global instSite2;     // instSite2 = end of region
+volatile global instSite2_tmp; 
+
+double eth[20];
+
+void bar() {
+  int cnt = 0;
+  double z;                    // <-- inserted for the ret val of end of region
+                               // inst call (call inserted by phase 2)
+
+  {
+    instSite1_tmp = instSite1; // <-- record the address of this instSite1; a
+                               // load of this address identifies this location
+                               // in the code; the code:
+                               // double y = elapsedTimeStart() is to be
+                               // inserted here by phase 2 [replacing ld]
+			       // Was: pp_sigfun_interval_start("eth", elapsedTimeStart);
+
+    while(cnt++ != 15) {
+      foo();
+      printf(...);
+    }
+
+    instSite2_tmp = instSite2; // <-- record the address of this instSite2; a 
+                               // load of this address identifies this location
+                               // in this code; the code:
+                               // z = elapsedTimeEnd(&y) is to be
+                               // inserted by phase 2 [replacing ld]
+                               // Was: pp_sigfun_interval_end("eth", elapsedTimeEnd);
+
+    pp_series_add(eth, z);     // inserted by phase 1, uses z even though it
+                               // hasn't been written to. z and eth both exist.
+  }
+  ...
+  printf("avg reading was %f\n", pp_avg(eth));
+}
+
+     }}}
+     {{{ Post-phase2 code (high level)
+
+struct GBT {
+  // fields for GBT go here...
+} the_gbt = { initializer };
+
+volatile global instSite1;     // instSite1 = start of region
+volatile global instSite1_tmp; 
+volatile global instSite2;     // instSite2 = end of region
+volatile global instSite2_tmp; 
+
+double eth[20];
+
+void bar() {
+  int cnt = 0;
+  double z; 
+
+  {
+    double y = elapsedTimeStart(); // <-- y must be alloca'd, ugh.
+
+    while(cnt++ != 15) {
+      foo();
+      printf(...);
+    }
+
+    z = elapsedTimeEnd(&y);
+    pp_series_add(eth, z);   
+  }
+  ...
+  printf("avg reading was %f\n", pp_avg(eth));
+}
+
+     }}}
 
+Aside from the obvious difficulties with phase 2 (find the load locations, etc),
+an additional difficulty exists: we must alloca the temporary for the return
+value of the first instrumentation function for a region.  Originally, I thought
+that this meant we'd need to have to (enough) reserved space at the entry of the
+instrumented function to place 'n' alloca calls, etc: one for each temporary.
+However, I believe that since our current approach allows essentially arbitrary
+code to be inserted into the tracecache region (supposedly), we no longer have
+this problem: invoke alloca immediately before the call.  The only problem with
+this is finding available registers to use, something which I don't understand
+at all.  If we must use a temporary that exists at the end of phase 1, that is
+also a possibility, but then we've got to place un-removable uses of all of
+those temporaries at the start of main (or something) so that they do not get
+eliminated. This shouldn't be a problem though.  The more general problem of
+"register schedule violation potential", however, may still be a problem:
+consider taking the address of the alloca'd temporary and passing it to the
+end-region instrumentation function, for example.
 
+}}}