[llvm-commits] CVS: llvm/lib/Reoptimizer/Inst/lib/design.txt

Tue Jun 24 10:36:00 PDT 2003

Changes in directory llvm/lib/Reoptimizer/Inst/lib:

design.txt updated: 1.15 -> 1.15.2.1

---
Log message:




---
Diffs of the changes:

Index: llvm/lib/Reoptimizer/Inst/lib/design.txt
diff -u llvm/lib/Reoptimizer/Inst/lib/design.txt:1.15 llvm/lib/Reoptimizer/Inst/lib/design.txt:1.15.2.1

--- llvm/lib/Reoptimizer/Inst/lib/design.txt:1.15	Sun May 18 12:45:26 2003
+++ llvm/lib/Reoptimizer/Inst/lib/design.txt	Tue Jun 24 10:35:06 2003
@@ -880,91 +880,70 @@
 
 {{{ MILESTONES
 
+  - Experiments
+  - The Paper
+  - The Thesis
 
 }}}
 
 {{{ TODO
 
-    - Move statically-sized spill regions so that they are internal to SparcInstManip.
-      (do not need variable-sized spill region except for phase5 invocations)
+    In priority order:
 
-    - Start table-of-stacks implementation for phase4 authorship of phase 5 slots.
-      - Placed on hold temporary because of "alloca-finding" approach. However, see the 
-        following e-mail for the current state of things:
-
-        {{{ E-mail regarding alloca-finding and table-of-stacks approach
-Okay, this is starting to seem intractable. I have another problem that
-I don't think can be resolved without resorting to a custom-stack
-mechanism that will incur prohibitive overhead.
-
-Everything is working for start-region instrumentation sites.  For
-end-region instrumentation sites, however, there's a problem. In order
-to write the slot for end sites, I have to know (or know how to compute)
-the address of the return value of the corresponding start site. I had
-originally thought that I would just store this in the GBT, or
-"something", but I clearly didn't think through the problem well enough.
-
-There are only two ways I can think of that this can occur:
-
-(a) Write the effective address of the return value of the start inst
-func, so that it gets passed to the end inst func.
-
-or
-
-(b) Somehow encode the stack offset to the return value from the start
-inst, where the offset is from the %sp *at the end-region site*
-
-Both of these have problems.
-
-First, I don't think (b) can work at all, given that there may be
-alloca's present in the original application that would change the %sp,
-and thus the offset value that we'd need, and we can't determine the
-exact allocas that are executed statically.
-
-For (a), the effective address isn't known until runtime. We can store
-this address in some global table where the phase 4 invocation for the
-end site can find it, but it is not sufficient to have a single scalar
-address here -- we must have a stack, due to potential recursive
-invocations. I think that this is clear, please let me know if I'm not
-making sense. :) 
-
-Hence, we'd need to maintain a stack of effective addresses, which was
-pushed during the execution of phase 5 for the start site, and then read
-and popped during the execution of phase 5 for the end site.  We're
-already really bloated with how many instructions we've got going on for
-all of the spills, etc, and I'm concerned about the effect that this
-stack manipulation will have on our overhead, as we talked about before.
-
-The way I see it, we only have two options if we're to make forward
-progress and not obliterate our chances of having lower overhead
-numbers. Hopefully we have some better choices. In the interests of
-short-term forward progress, I'm going to go with #1 for now.
-
-#1 - Make another common-case assumption that there will be no allocas
-between start and end sites, on *any* control path. If this is the case,
-then we know that the stack pointer will not have been manipulated (I
-think) between the start and end sites, and so the %sp offsets to the
-requisite data will be unchanged since when the phase 5 step occurred
-for the start site.  
-
-#2 - Just implement our fall-back solution that everything seems to be
-pointing to. I'm not sure exactly what other logistic nightmares might
-be entailed in this, though, because I've only a sketch of the idea.
-
-I wanted to point out, also, that the so-called "fall back" approach we
-discussed previous also involves manipulation of a stack at runtime
-(push/pop actions still have to occur at runtime), so perhaps the stack
-of effective addresses is less prohibitive than I thought, if only in
-the sense that we cannot avoid it. :(
-        }}}
-
-    - Write phase 5 stuff for end-region sites -- will assume that not allocas lie between
-    the start and end sites, which is not particularly a fair assumption.
-
-    - Optimizations:
-        - No need to save registers (other than those clobbered) in phase 3 slot, since phase 3
-          is invoked at the start of the function. Must still spill/restore shared, though.
-        - No need to save registers (other than those clobbered) in general.
+    1) New implementation for instrumenting at function-level granularity
+    2) Apache through LLVM, experiments
+    3) Writing, writing, writing: ICS version 2 paper, do
+       a) outline
+       b) intro
+       c) language section
+       d) compiler section
+
+    The three top-level items above are more-or-less interchangable. However, the
+    experiments will not be able to be completed unless the "instrumentation @
+    function-level granularity" implementation is done, and there should be an emphasis on
+    getting Apache through LLVM in the short-term because Chris is leaving for Norway in
+    early July.
+
+    However, *all* of the writing (including experimentation sections) for ICS v2 must be
+    done by the end of June, so that Adve can approve it, make the desired corrections,
+    etc., so that I can get started on thesis authorship and submission.  Realistically,
+    the timeframe should look something like this:
+
+    Week of 6/10 (5 days): Implementation, Apache w/ bug reports
+    Week of 6/16 (6 days): Implementation, Small example, continue Apache & do tests
+
+    -- At this point, all experiments should be more-or-less completed --
+
+    Week of 6/23 (6 days): Write, write, write. 3 days nonstop for content, 3
+                           days of Adve making corrections, etc.
+
+    Monday, 6/30 is D-Day...
+
+    Schedule revision 13 Jun 2003: Apache is in stasis waiting to hear back from
+    Chris. Some effort might be expended to see if any more work can be done on compiling
+    Apache even though the build process currently fails altogether.  Implementation for
+    instrumentation at function-level granularity is in stasis until I hear back from
+    Vikram.  This leaves three immediate options open until either of the two are
+    resolved, in which case forward progress on either the implementation or Apache should
+    be made.
+
+       Option 1) POV-Ray through LLVM.
+       Option 2) Writing
+       Option 3) "Small example"
+
+    Option 3 really shouldn't be undertaken until we know if obtaining things like I/O
+    elapsed time (via function-level) is possible, or until I can talk with Vikram in our
+    next meeting about what the heck this nebulous example should look like...this leaves
+    only options 1 and 2 above as viable.
+
+    POV-Ray should be an easy thing to start, and would be good both as a fall-back if
+    Apache isn't possible as well as a useful additional example if the latter is
+    possible.  This would also give V & C some time to respond to the pending queries.
+
+    - Optimizations: - No need to save registers (other than those clobbered) in
+    phase 3 slot, since phase 3 is invoked at the start of the function. Must
+    still spill/restore shared, though.  - No need to save registers (other than
+    those clobbered) in general.
 
 }}}
 
@@ -1460,5 +1439,226 @@
 Also, Chris remarked that any novel page-management mechanisms, etc., (for the
 code that is jumped to in newly-allocated pages) that I devise should perhaps be
 integrated into the LLVM JIT if they are suitable.
+
+}}}
+
+{{{ Experiments
+
+We must devise experiments that demonstrate the novel aspects of the work.  We
+are currently planning on using Apache and/or POV-Ray, and demonstrating how a
+"deep" performance analysis can be encoded using performance primitives. A
+"deep" performance analysis is one which essentially (using Hollingsworth's
+terminology from his PhD thesis) gets at the "what, where, and when" aspects of
+performance bottlenecks.  However, instead of doing this at the "arbitrary
+program during execution" level, as the W^3 search model does, we will encode
+these performance aspects at the application level itself.
+
+Here's what Vikram suggested for a good start:
+
+  I just thought that the examples of performance issues he explores in his
+  automatic search would give you (a) some insights into what performance issues
+  make sense to consider, and (b) some ideas about how to do a systematic
+  diagnosis, albeit at the application level.
+  
+  But isn't detecting the "what, why, and when" of performance bottlenecks
+  pretty closely related to the goal of performance diagnosis?  We'd be looking
+  for bottlenecks too, except that we can use application domain information, we
+  can look for bottlenecks at the algorithmic level instead of the general
+  system level, and we can record it permanently in the program as a first class
+  feature of the program.
+  
+  Anyway, about your second question: here's a way to do what I suggesting:
+  
+  -- think about 2-3 key performance issues with Apache (or POV-Ray) that you'd
+  want to diagnose e.g., cache misses, TLB misses, thread overhead (estimating
+  that could be interesting), I/O delays
+  
+  -- if those issues make sense with a small sort example, try to diagnose those
+  issues in the small example first.  e.g., I think cache misses, TLB misses,
+  and I/O delays would all be issues if you were sorting a huge file of some
+  kind.
+  
+  This is purely to give you a small, well-understood code to try out before
+  going to the big ones where it may be difficult to know, when one diagnosis
+  attempt fails, whether it failed because you misunderstood the performance
+  issues or because the guess was wrong or both.
+
+TLB misses aren't an option, because we cannot get at them with PAPI. Cache
+misses are available, so that'd certainly be a good place to start.  As for I/O
+delays, I have no idea how we'd measure this, either.  The following is the
+comprehensive list of the low-level metrics that are exposed to us via PAPI.  On
+our own, we can support simple things like elapsed time, load average, etc.  I'm
+not altogether clear how we'd determine how much time (for a particular region)
+was spent doing I/O-bound activities...
+
+Number of hardware counters: 2
+Name: PAPI_L1_ICM 	Description: Level 1 instruction cache misses
+Name: PAPI_L2_TCM 	Description: Level 2 cache misses
+Name: PAPI_CA_SNP 	Description: Requests for a snoop
+Name: PAPI_CA_INV 	Description: Requests for cache line invalidation
+Name: PAPI_L1_LDM 	Description: Level 1 load misses
+Name: PAPI_L1_STM 	Description: Level 1 store misses
+Name: PAPI_BR_MSP 	Description: Conditional branch instructions mispredicted
+Name: PAPI_TOT_IIS 	Description: Instructions issued
+Name: PAPI_TOT_INS 	Description: Instructions completed
+Name: PAPI_LD_INS 	Description: Load instructions
+Name: PAPI_SR_INS 	Description: Store instructions
+Name: PAPI_TOT_CYC 	Description: Total cycles
+Name: PAPI_IPS    	Description: Instructions per second
+Name: PAPI_L1_DCR 	Description: Level 1 data cache reads
+Name: PAPI_L1_DCW 	Description: Level 1 data cache writes
+Name: PAPI_L1_ICH 	Description: Level 1 instruction cache hits
+Name: PAPI_L2_ICH 	Description: Level 2 instruction cache hits
+Name: PAPI_L1_ICA 	Description: Level 1 instruction cache accesses
+Name: PAPI_L2_TCH 	Description: Level 2 total cache hits
+Name: PAPI_L2_TCA 	Description: Level 2 total cache accesses
+
+So, in the short-term, we have two outstanding problems. First, what kinds of
+metric would we want to apply to Apache/POV-Ray/Simple example? Second, of those
+metrics, which can we actually realize with the current system? Third, we should
+create the simple program (such as sorting large amounts of data from a file,
+etc.) so that it can use these metrics in such a way that "models" or
+"anticipates" the way they will be used in the bigger applications.
+
+This is an outstanding issue, and I don't really know where to go with it yet.
+
+More notes about this as of 4 Jun 2003:
+
+One way to obtain metric values that are based on the elapsed times of
+particular functions is to somehow register instrumentation for those particular
+functions, and for a particular region -- Vikram argues that we have the ability
+to do this dynamically and we don't need any markers or phase-1 actions because
+we're operating at function-level granularity. 
+
+Here is a sample scenario: We have defined an interval I over some scoped region
+of code.  During phase 1 and phase 2, no instrumentation is registered for this
+interval.  Later on, we construct a metric that is qualified by a list of
+functions that (for example) are to have their runtimes measured and added to
+some running total.  Let's call this the "measure_functions_start" and
+"measure_functions_end" metric, and have it yield a value of type double which
+is the aggregate runtime of the list of functions when they get executed within
+interval I.  The metric registration function will have to have some way
+(varargs?) of denoting what the functions are: perhaps it can simply pass in an
+array of function names together with a size.
+
+Example: 
+
+pp_registerIntervalInst(intervalID, measure_functions_start,
+                        measure_functions_end, &retVal, sizeof(double), 
+                        func1, func2, func3, ...);
+
+However, what do "measure_functions_start" have to do with anything? More than
+likely, what we need to do is specify a particular metric to apply to a
+paritcular function, such that the value will be sampled each time that function
+gets executed.  Then, since there can be multiple invocations (and hence,
+multiple samples) for this selected function within I, we will have to have some
+default or user-specified way of aggregating the data :(.  This is gross.  In
+other words, we should probably simplify the above to something like:
+
+pp_registerIntervalInst(intervalID, some_metric_start, some_metric_end, 
+                        &retVal, sizeof(double), HOW_TO_AGGREGATE, func1);
+
+Where func1 is the function to be instrumented and HOW_TO_AGGREGATE is some
+value that specifies one from a couple of ways of combining the data.  For now,
+HOW_TO_AGGREGATE will not exist and will implicitly sum all return
+values...hence, if the some_metric_{start,end} function ptrs above were to
+elapsed time, at the end of the interval I with interval id intervalID, retVal
+would contain the combined elapsed time of all time spent in function func1.
+
+Enabling this measurement at the start of the interval, disabling at the end of
+the interval; clearly, the start-instrumentation site will need to transform
+func1 (to compute the metric value) when crossed, and the end-insturmentation
+site must remove such instrumentation when crossed. This doesn't follow the
+normal model of what occurs at the instrumentation sites, in terms of just
+ripping down a list of functions...or does it? Perhaps one of the functions in
+the list is just this function that does the transformation on the target
+function...in this case, the process would look something like:
+
+1. Register the transformation function (for the start point) -- call this
+xform_start -- as a regular instrumentation function.  The runtime call to do
+the registration will build the appropriate data structures which will encode
+what metric to associated with the target function, aggregation method, return
+value, etc.
+
+2. Register the transformation function (for the end point) -- call this
+xform_end -- as a regular instrumentation function.
+
+3. When xform_start is invoked as a regular instrumentation function, it will
+instrument the target function with the selected instrumentation.
+
+This is the hardest step to conceptualize and realize.  The problem is that,
+without any placeholders from phase 1, it's not clear that we can instrument the
+target function easily.  Clearly, our instrumentation points are at the start
+and end points of the function (entry and exit).  But this is not really true.
+Our instrumentation points are really at the entry and *all* function exits.
+
+The important question is, what if we have all of the exit points at our
+disposal? Would that change anything?
+
+It would. The entry point together with all exit points would form a set of
+instrumentation points. For each of these instrumentation points, we could
+over-write a branch to a new slot that would call the desired instrumentation
+function, restore the replaced instruction, and return to the instrumentation
+point to continue execution.  This would potentially work. One major problem
+that comes to mind is that for system calls (such as read()), the body of the
+function is highly likely to be out of short-jmp range to the
+tracecache-allocated slot.  The only way around this would be to create a heap
+region and copy the target function into it, etc.  We don't have any code to do
+this yet, so again there is no code reuse or (much) leveraging of existing
+functionality.  Additionally, work will have to be done to make the data
+structure that maps the address of a function to its extents (which come from
+the ELf information).
+
+--
+
+The other alternative is to make sure a "wrapper" function (i.e., for the
+function read()):
+
+int read_wrapper(args for read) {
+  start_inst();
+  int rc = read(args for read);
+  end_inst();
+  return rc;
+}
+
+But this isn't an option because we cannot locate the calls to read() to replace
+them with the wrapper.  We could do the following:
+
+3a. Copy the entire target function to a heap region, and instrument it to our
+heart's content.  However, finding exit points may not be easy without a CFG,
+etc.
+
+3b. Replace the body of the real target function with a call to the modified
+duplicate of the target function, returning whatever the modified duplicate
+returns.
+
+This works, I think, but is incredibly cumbersome and, contrary to what was
+previously discussed, we do *not* possess all of the mechanisms.
+
+--
+
+Notes on POV-Ray experiment 23 Jun 2003
+
+We propose using a user-defined metric, flops/display pixel to compute the
+computational cost per pixel.
+
+1. Compute flops/dpixel and report using pp mechanisms.
+  1a. flops can be obtained via PAPI, IIRC.
+
+2. Create a moving average of flops/dpixel.
+
+3. When the flops/dpixel exceeds mavg flops/dpixel by some multiplicative
+threshold (e.g.), a performance assertion is violated.  When the PA is violated,
+we report other metrics (perhaps ranked in some manner) that had been being
+recorded but not reported.  Initial suggestions for the other metrics to measure
+_over the same region_ (say, the main routine for a single ray-trace) are:
+
+    1) MAverage L1 icache misses vs. this-ray L1 icache misses
+    2) MAverage L2 cache misses vs. this-ray L2 cache misses
+    3) MAverage load count vs. this-ray load count
+    4) MAverage store count vs. this-ray store count
+
+Immediate problem: I don't think we can monitor more than 2 of these values
+concurrently due to hardware limitations on the SPARC...
 
 }}}