[llvm-commits] CVS: llvm/lib/Reoptimizer/Inst/PerfInst.cpp design.txt

Mon Mar 17 18:43:01 PST 2003

Changes in directory llvm/lib/Reoptimizer/Inst:

PerfInst.cpp updated: 1.1 -> 1.2
design.txt updated: 1.1 -> 1.2

---
Log message:



---
Diffs of the changes:

Index: llvm/lib/Reoptimizer/Inst/PerfInst.cpp
diff -u llvm/lib/Reoptimizer/Inst/PerfInst.cpp:1.1 llvm/lib/Reoptimizer/Inst/PerfInst.cpp:1.2

--- llvm/lib/Reoptimizer/Inst/PerfInst.cpp:1.1	Wed Mar  5 15:23:56 2003
+++ llvm/lib/Reoptimizer/Inst/PerfInst.cpp	Mon Mar 17 18:49:31 2003
@@ -36,7 +36,7 @@
 {
     static bool initialized = false;
     static Module* pMod = 0;
-    static std::vector<Function*> funcList;
+    static vector<Function*> funcList;
 
     cerr << "phase2 invoked" << endl;
     
@@ -52,17 +52,28 @@
 
         // Gather pointers to functions into funcList
         for(Module::iterator i = pMod->begin(), e = pMod->end(); i != e; ++i) {
-            if(!i->isExternal()) {
+            if(!i->isExternal())
                 funcList.push_back(&*i);
-                cerr << "Added to funcList: " << i->getName() << endl;
-            }
         }
     }
 
     assert(pMod && "Module must have been parsed");
     assert(funcList[methodNum] && "Have not obtained methodNum'th function in funcList");
 
-    cerr << "Completed phase2." << endl;
+    cerr << "Dumping list of instructions in each function..." << endl;
+
+    for(vector<Function*>::iterator i = funcList.begin(), e = funcList.end(); i != e; ++i) {
+        cerr << "Processing function " << (*i)->getName() << endl;
+        for(Function::iterator bbi = (*i)->begin(), bbe = (*i)->end(); bbi != bbe; ++bbi) {
+            for(BasicBlock::iterator ii = bbi->begin(), ie = bbi->end(); ii != ie; ++ii) {
+                cerr << "Processing instruction: " << *ii << endl;
+                vector<uint64_t> vec = getLLVMInstrInfo(&*ii);
+                cerr << "Obtained the following vector from getInstrInfo:" << endl;
+                for(unsigned k = 0; k < vec.size(); ++k)
+                    cerr << vec[k] << endl;
+            }
+        }
+    }
 }
 
     


Index: llvm/lib/Reoptimizer/Inst/design.txt
diff -u llvm/lib/Reoptimizer/Inst/design.txt:1.1 llvm/lib/Reoptimizer/Inst/design.txt:1.2
--- llvm/lib/Reoptimizer/Inst/design.txt:1.1	Wed Mar  5 15:23:56 2003
+++ llvm/lib/Reoptimizer/Inst/design.txt	Mon Mar 17 18:49:31 2003
@@ -1,70 +1,498 @@
-The goal of phase 2 is to:
+{{{ OVERALL GOALS OF PHASE 2
 
   - identify all loads of global volatile variables and the
     corresponding stores to temporaries
 
-  - replace the load/store pair (together with address calculations?)
-    with a call to the appropriate function, as determined by the
-    static global data structures.
+  - replace the load/store pair with a call to the appropriate
+    function, as determined by the static global data structures.
+
+  - ensure that the result(s) of the metric function(s) are stored
+    properly and that the locations in which the results are stored
+    correspond properly to the metric use sites.
+
+{{{ Assumptions:
 
-Assumptions:
   We can locate all load instructions in the code, and find the
   corresponding stores.
 
   Optimization in the compiler is on. This means that we can't really
   rely on a particular "signature" of generated assembly code.
 
+}}}
+{{{ Problems:
+
+  {{{ Finding load insts (answered) 
+
+  How can we narrow down the set of load instructions to those instructions that
+  are loading the global volatile variables? In particular, all we see in the
+  load instruction is the register parameter which specifies the register that
+  contains the address being loaded from.  Since we cannot rely on a particular
+  signature of code that is used to place the address in that particular
+  register, we can't simply look at a few preceding instructions to determine
+  the address.  We know that the address must be a constant, but the optimizing
+  compiler may compute the address by simply adding some constant value to some
+  other register (the compiler is very clever and may employ any kind of
+  value-numbering techniques, etc to do this address calculationg...the peephole
+  optimizer kills us here).  Thus, we basically need a way to discover, at each
+  load instruction, whether the address stored in the load register is constant
+  and, if so, what the constant is.  Once we know this, we can lookup the
+  constant in the table and determine what call to insert in its place, etc.
+  This has to be flow senstive, and possibly interprocedural?  NB: Nothing about
+  this question is unique to doing the modification at runtime, but is rather a
+  fundamental question specific to the approach in general.
+
+  [Answer: we use a heurstic "backwards search" mechanism to discover the
+  constant address used as the pointer operand to the (volatile) load
+  instructions. See the EEL paper for more information] 
+
+  }}}
+  {{{ Removing address calculations (answered)
+
+  How can we safely erase the address calculation instructions without knowing
+  that the temporary (or final!) values won't later be used for further relative
+  computations? This essentially requires use-def information to be present, and
+  might even need to cross procedure boundaries.  Talk to Adve about this.  NB:
+  Nothing about this question is unique to doing the modification at runtime,
+  but is rather a fundamental question specific to the approach in general.
+
+  [Answer to the previous two questions: We can't determine safety without more
+  analysis than we want to do (and potentially, not even then!), so we don't
+  overwrite address arithmetic instructions, etc. 
+
+  }}} 
+
+}}} 
+{{{ Musings on trampolines:
+
+- Assuming that we can leave the address calculation in place for the
+time being, here is a description of the *actual* problem:
+
+The code segment looks like this:
+
+Snippet 1:
+
+I1
+I2
+ld (of volatile)
+I3
+I4
+...
+store (to temporary)
+branch 
+...
+
+Now, we must over-write the store instruction with a nop, which
+shouldn't pose a problem.  The primary problem arises when we want to
+over-write the load with a call to an instrumentation function.  This
+call requires a delay slot to follow, and we can't grow the code
+size.  We can use a trampoline approach as in DynInst, but this
+doesn't really solve the problem because the branch-to-tramp
+instruction which would replace the load *also* requires a delay
+slot.  We need to determine how the DynInst folks solved this
+problem.  Assuming that we don't have the special case (which will
+have to be addressed separately):
+
+load (of volatile)
+branch
+store (to temporary)
+
+[this is a problem because there *can* be no delay slot -- perhaps use
+some kind of special un-eliminable instruction to pad between the load
+and branch so that this doesn't occur; whatever]
+
+Then the problem is solved _as long as the (non-branch) instruction
+following the load is not the target of some other branch_.  That is,
+snippet 1 would become:
+
+ 	 I1          ____> tramp code
+ 	 I2         /        ....
+ 	 br   -----/       I3
+     __> nop              /
+     |	 I4  <-----------/
+     \	 ...
+      -- branch
+         ...
+
+The branch-to-tramp is unconditional, so we only have a problem when
+we get to the branch that jumps to the previous location of I3: I3
+will not get executed because the nop is in its place.  Vikram has
+suggested the use of a "anullable" delay slot wherein the branch
+instruction delay slot only executes if the branch *is*/*is not*
+taken.  We must look in the Sparc ISA to determine whether or not this
+is feasible.  If that turns out to be a dead end, we'll examine how
+DynInst *must* have solved this problem and see if we can use their
+solution.
+
+From the Sparc V9 arch spec: For unconditional branches with an
+"always" specified condition (the kind of branch we'd be using for the
+tramp), if the annul bit is 1, the instruction in the delay slot
+*never* executes.  If the annul bit is 0, the instruction in the delay
+slot is always executed.
+
+Hypothesis: Assuming we don't have the branch-imm-after-load inst
+special case described above, it is my contention that all we really
+need to do is over-write the load instruction with an unconditional
+branch with the annul bit set to 1 (to signify "never execute").  We
+say never execute because I3 must be executed *after* the call to the
+instrumentation function in order to preserve the original semantics.
+Thus, our trampoline code calls the instrumentation function and
+executes I3 in its epilogue.  Execution resumes at I4, and I3 is still
+a valid target for a later branch instruction.
+
+The new problem: This probably works, except that the base trampoline
+is probably not within range (can it be?) of a PC-relative jump from
+the load instruction we're trying to replace.  The MDL paper seems to
+indicate that it "often takes multiple instructions" for a jump. Why
+would this be? Figure this out. According to the arch spec, all branch
+instructions are PC relative.  I assume that they are referring to the
+need to use a long jump (JMPL), which means that the extra
+instructions would be required to pack the register with the target
+address.
+
+}}}
+{{{ Trampoline-related ideas:
+
+(Thanks Brian!) :)
+
+1. Call wrapper
+
+Ignoring the ugly special case mentioned above (load followed by
+branch), what about using a call instruction to invoke a a piece of
+code that simply wraps the call to the instrumentation function +
+whatever instruction needed to go into the delay slot?  Would have to
+work out the details of return values from the calls, etc, but the
+gist is as follows:
+
+I1
+I2
+ld (of volatile)
+I3
+I4
+...
+store (to temporary)
+branch 
+...
+
+Would become
+
+I1
+I2
+call wrapperFunc
+nop
+I4
+...
+store (to temporary)
+branch
+...
+
+Where wrapperFunc's body is:
+  call instrumentationFunc
+  nop
+  I3
+  ret
+
+This solution is really using a wrapper call instead of a base
+trampoline code segment, and it avoids the problem of having to use a
+longjmp. 
+
+The remaining problem: branches (direct or indirect) to I3 in the
+original code.  This may kill this solution.
+
+2. Statically-generated base trampoline.
+
+We know what the instrumentation points are, because we don't have
+DynInst's need to instrument points dynamically, at runtime.  Thus,
+what if phase 1 were to (statically) generate what was effectively the
+"base tramp" code, ensuring that it was within PC-relative jump range?
+Thus our code snippet:
+
+I1
+I2
+ld (of volatile)
+I3
+I4
+...
+store (to temporary)
+branch 
+...
+
+Would become:
+
+I1
+I2
+br baseTrampN (annul bit set so that I3 is *never* executed)
+I3
+I4
+...
+store (to temporary)
+branch 
+...
+
+
+baseTrampN:
+  ...
+  call instrumentationFunc
+  I3
+  direct branch back to I4
+
 Problems:
+  I3/I4 aren't known at phase 1, are they? In our code example, I3/I4
+refer to the post-optimization instructions at the machine code level,
+not LLVM bytecode instructions, so this approach may be flawed from
+the start.
+
+Question: Is there any way to guarantee a large enough region of
+reserved space in the in-range-of-PC-relative-jump base tramp code so
+that the address of I4, etc, could be obtained...? We could
+potentially write that code in as well...perhaps using more volatile
+loads. :) That is, we generate
+
+baseTrampN:
+  call instrumentationFunc
+  load volatile
+  load volatile
+
+Where the last two load instructions are over-written with I3 and the
+branch to I4.  The only problem here is making sure that the
+(arbitrary) compiler doesn't eliminate the generated code for the
+baseTrampN, which is effectively dead when it witnesses it.
+
+3. Entire duplication of function + modification
+
+In order to avoid re-writing the code segment entirely, we could copy
+the entire function body (for each function which had instrumentation)
+to a new area in memory where we could add arbitrary instructions as
+desired, updating offsets as needed.  The original function would be
+nop'd over with the exception of the call/longjmp to the duplicated
+code.  Any direct branches into the old function would need to be
+remapped to the new location (may not be possible) -- are indirect
+branches an issue?  This is a really wasteful and somewhat gross
+approach.
+
+(NB: Requires knowing the size of function bodies)
+
+}}}
+{{{ Considering an LLVM-centered approach:
+
+[Arguments and justifications -- most of these thoughts were sent to
+Vikram on 12 Mar 2003]
+
+I perceive the following issues to be orthogonal:
+
+1. Language-level support for metric description & queries.
+
+2. Attempting to enable more compiler optimizations due to our
+approach (i.e. not prohibiting certain optimizations due to the
+presence of opaque function calls, etc).
+
+3. Vendor compiler independence.
+
+4. Runtime support for the standard library of metrics.
+
+In our approach thus far, we've attempted to achieve kill the birds #2
+and #3 with one stone by using variable volatility coupled with a
+proposed post-link modification of the binary executable.  Since I
+believe #2 and #3 to be orthogonal, let us see what happens if we
+significantly diminish the importance of #3, and consider how we may
+benefit from doing so.  In particular, I'd like to consider using only
+LLVM as the compiler infrastructure of choice, and completely ignoring
+other compilations systems which will optimize the code significantly
+more than LLVM.
+
+It is my belief that the real "meat" of our research lies in exploring
+the capabilities and limitations uncovered when performance aspects of
+the program are exposed to the programmer through the language and
+runtime system (that is, #1 and #4 above), as opposed to a traditional
+library-based approach.  In particular, what happens when we expose
+lower-level runtime execution properties as metrics to the application
+itself?  By "lower-level runtime execution properties", I mean things
+like predicted pipeline behavior, cache behavior, TLB stats, etc., but
+particularly those that would require the role of the compiler in the
+process, such as the pipeline behavior prediction and analysis.
+Again, we're to focus on the primary
+differences/advantages/disadvantages that arise from NOT using the
+traditional library approach.
+
+With this in mind, I see the prevention of inhibited compiler
+optimizations as a "bonus", or at least a peripheral issue.  If we can
+make things better than the library approach with respect to inhibited
+optimizations, great! However, I don't believe that this is really the
+"selling point", especially w.r.t. our current volatile approach,
+which was used to obtain both #2 and #3 in the list above, because its
+efficacy depends greatly on the compiler implementor's interpretation
+of the spec w.r.t. volatility (i.e. the particular implementation
+determines whether or not anything is truly better over the
+traditional approach).  We know that most compilers will do better
+with a load of a volatile in place of an opaque call, and may wish to
+leverage that, but I don't think that this is central to the validity
+of the work overall.
+
+If we tie ourselves to a particular research compiler, we can still
+obtain or heavily pursue #2 by using annotations or some other
+mechanism which allows us to denote a call site (to an instrumentation
+function) as "side-effect-free", and this can enable whatever
+optimizations are performed by the compiler.  This seems like a
+reasonable compiler capability to encourage in vendor compilers in
+general, and I don't see why we can't model it as a "beneficial
+feature" which allows us to capitalize on #2, but is not something
+which is fundamental to the work.
+
+If we do "tie" ourselves to LLVM, as we're already talking about doing
+in some sense with our use of the reoptimizer and (most likely) our
+reliance on the LLVM-to-Sparc mapping information, we may also gain
+short-term implementation feasibility by side-stepping the complex
+binary editing issues.
+
+}}}
+
+}}}
+
+{{{ MEETING MINUTES 14 Mar 2003
+
+We discussed the primary differences between our work and the MDL
+language/compiler system (by the DynInst folks).  We discussed that
+there are some fundamental differences in the capabilities of the two
+systems.  Primarily,
+  - feedback of performance data cannot occur in MDL
+  - MDL cannot instrument (or currently doesn't) arbitrary points
+  - Cannot enable/disable instrumentation w.r.t control flow
+
+Also, in terms of syntax and specification distinctions:
+  + our approach uses "data structure" primitives & value histories
+  - they've got hierarchical constraints and method-globbing
+    mechanisms
+  - "decoupling" of metric description from instrumentation point.
+
+Vikram expressed the need to clearly demonstrate the differences
+between the two approaches in any write-up that we do.  IT will be
+vital to "pull apart" the issues, referring to orthogonality as much
+as possible", and decide exactly what is distinct.  Alos, he pointed
+out that if our Phase1 didn't process the source directly and we
+didn't use arbitrary instrumentation points, we can do *exactly* what
+MDL can do (we think) without source access.
+
+Implementation sketch:
+
+At a high level, in broad sweeping strokes, we're going to use the
+trace cache tool as a framework for runtime manipulation of the binary
+code.  That is, the framework provided by the tracecache allows the
+selection of a path from a CFG, its subsequent mapping into an
+arbitrarily-sized piece of memory, where it can be manipulated.
+Although the tracecache takes a CFG subgraph and maps it into this new
+region of memory, it does so in a "straight-line" manner for the hot
+path, rewriting intra-region branches as needed.  We're not making any
+assumptions about a particular path being executed, and so we'll
+actually be mapping the entire CFG subgraph into the new region of
+memory, keeping the branch structure intact.
+
+We're currently operating under the assumption that the delimiting
+markers (of the scope or point at which a metric evaluation call is to
+be palced) for the region of code we wish to transform are visible to
+the runtime system (somehow).  Assuming we have a start and end point,
+[or WOLOG end points (plural) because of multiple exit points of the
+target function] then we may demarcate the region of the CFG that is
+to be transformed.  Call this CFG region R. We will be mapping the CFG
+into the new memory region.  Before transformation, execution is
+semantically equivalent w.r.t the original code in memory and the new
+(copied) code in memory.
+
+We also assume that it is possible to "pad" the start point of R such
+that longjump instructions may be placed there to reroute execution to
+the copied area of memory.  The trace-cache is supposed to handle this
+rerouting, but the facility must exist regardless of who provides it:
 
-  How can we narrow down the set of load instructions to those
-  instructions that are loading the global volatile variables? In
-  particular, all we see in the load instruction is the register
-  parameter which specifies the register that contains the address
-  being loaded from.  Since we cannot rely on a particular signature
-  of code that is used to place the address in that particular
-  register, we can't simply look at a few preceding instructions to
-  determine the address.  We know that the address must be a constant,
-  but the optimizing compiler may compute the address by simply adding
-  some constant value to some other register (the compiler is very
-  clever and may employ any kind of value-numbering techniques, etc to
-  do this address calculationg...the peephole optimizer kills us
-  here).  Thus, we basically need a way to discover, at each load
-  instruction, whether the address stored in the load register is
-  constant and, if so, what the constant is.  Once we know this, we
-  can lookup the constant in the table and determine what call to
-  insert in its place, etc.  This has to be flow senstive, and
-  possibly interprocedural?
+   Original code:
  
-  How can we safely erase the address calculation instructions without
-  knowing that the temporary (or final!) values won't later be used
-  for further relative computations? This essentially requires use-def
-  information to be present, and might even need to cross procedure
-  boundaries.  Talk to Adve about this.
-
-  NB: Neither of the above two questions are unique to doing the
-  modification at runtime, but rather problems with performing the
-  transformation "in general".
-
-  Concrete problems with the runtime framework:
-    - What is the implementation status? What works, what doesn't
-      work?
-
-MILESTONES
-
-  - Prototype implementation for Reopt-based phase 2. Must address
-  above problems to fully realize this.  However, the first step is to
-  get familiar with the system and record what the concrete problems
-  are that arise.  To this end, the first step:
-
-    - A function which can be called by the target application which
-    performs "self-examination" of the program: either dumping out
-    instructions to stderr for comparison with disassembly, or
-    something close to it.  The primary purpose of this is to explore
-    the current implementation status of the system and to assess its
-    capabilities and suitablity for the full-blown phase 2
-    implementation.  Utilizes the VirtualMem object, etc, and is
-    modelled after the trigger routine in
-    Reoptimizer/Trigger/Trigger.cpp.
+           |                                  |            
+           V			              V            
+  (start of CFG REGION R)  =====>    (start of R) ---> new code
+          /   \			             /   \          |
+         ( CFG REGION )		            ( CFG REGION )  |
+   (end of CFG REGION R)	      (end of R)      <-----/
+
+Here are the proposed steps of the transformation.  
+
+1. Given a code segment S that corresponds to CFG region R, copy S
+into new memory region S'.  Any direct branches in the region are
+going to be PC-relative, while jumps out of the region may need to be
+rewritten.  Indirect jumps will have to be heuristically handled
+("backwards address extraction", read EEL paper for more info).
+
+2. Indirect jumps into the original code are not a problem: the
+modified code simply does not get executed (until the next iteration
+if in a loop).  This is trivial for point instrumentation, but for
+interval instrumentation it is taken care of by a set of a flag by the
+entry to the region and the check of the flag before calling the
+end-instrumentation function. [Actually, this doesn't seem to be an
+issue at all, come to think of it: the original code just has the
+extraneous loads of volatiles or whatever, it's doesn't contain any
+calls: we will have to guard any uses of the metric variables though
+(i.e. use in a stat function, etc).  This step requires more
+evaluation.
+
+3. Code is grown within the new code region, and branch
+targets/offsets are updated appropriately, using heuristics where
+possible.  The loads of selected volatiles are replaced with calls to
+the proper instrumentation function.  Those load instructions
+themselves are selected by determining what address they are loading,
+a constant value which is (again) heuristically determined.
+
+In all of this, an attempt is made to avoid use of the LLVM-to-Sparc
+instruction mapping.  This means using lower-level POSIX, ELF, etc,
+mechanisms to access everything we need (the global table of
+bookkeeping information, the names of functions to instrument, the
+starting locations of those functions and their ending locations/size)
+to start with.  We must also be able to identify the range of the code
+segment to which a particular transformation is supposed to apply to
+(since the tracecache maps a CFG into a new memory area, how the frell
+are we supposed to "not use" the LLVM mapping information when the
+tracecache construction seems to require it?), since the tracecache
+presumably needs the corresponding CFG region, etc. Look into this.
+
+Start here:
+
+The first order of business is to determine the feasibility of this
+approach for the long haul.  We must first see what information is
+needed by the tracecache module to create the new area of memory and
+map code into it.  After this is done, we will have a better idea of
+what information must be provided to it by our preliminary binary
+analysis.  After this, we need to read up on ELF/POSIX/etc mechanisms
+for reading the object file and determining things about it (the size
+of functions, for example).  Also, we must figure out how to get
+access to the global, static table of bookkeeping information without
+debugging information (take the address of it in a function, load that
+address, look for the address, verify?).
+
+See TODO list below.
+
+}}}
+
+{{{ MILESTONES
+
+- Extract and report bookkeeping data structure contents from raw
+compiled binary.
+
+- Determine if/how the tracecache framework can be used for a CFG
+subgraph "copy" to a new area of memory; determine whether or not it's
+worth the effort or whether it should be "done from scratch".
+
+}}}
+
+{{{ TODO
+
+- Read EEL paper to get a better feel for binary modification issues
+- Do sample by hand and revisit actions of both phases
+- Extract bookkeeping data structure contents, function stats/ends,
+  etc, using low-level POSIX/ELF mechanisms.
+
+}}}
+
+{{{ PENDING QUESTIONS
+
+[What about violation of register schedules when inserting new code?
+Is this an issue?]
+
+}}}