[llvm-commits] CVS: llvm/lib/Reoptimizer/Inst/lib/design.txt
Joel Stanley
jstanley at cs.uiuc.edu
Tue Jun 24 10:36:00 PDT 2003
Changes in directory llvm/lib/Reoptimizer/Inst/lib:
design.txt updated: 1.15 -> 1.15.2.1
---
Log message:
---
Diffs of the changes:
Index: llvm/lib/Reoptimizer/Inst/lib/design.txt
diff -u llvm/lib/Reoptimizer/Inst/lib/design.txt:1.15 llvm/lib/Reoptimizer/Inst/lib/design.txt:1.15.2.1
--- llvm/lib/Reoptimizer/Inst/lib/design.txt:1.15 Sun May 18 12:45:26 2003
+++ llvm/lib/Reoptimizer/Inst/lib/design.txt Tue Jun 24 10:35:06 2003
@@ -880,91 +880,70 @@
{{{ MILESTONES
+ - Experiments
+ - The Paper
+ - The Thesis
}}}
{{{ TODO
- - Move statically-sized spill regions so that they are internal to SparcInstManip.
- (do not need variable-sized spill region except for phase5 invocations)
+ In priority order:
- - Start table-of-stacks implementation for phase4 authorship of phase 5 slots.
- - Placed on hold temporary because of "alloca-finding" approach. However, see the
- following e-mail for the current state of things:
-
- {{{ E-mail regarding alloca-finding and table-of-stacks approach
-Okay, this is starting to seem intractable. I have another problem that
-I don't think can be resolved without resorting to a custom-stack
-mechanism that will incur prohibitive overhead.
-
-Everything is working for start-region instrumentation sites. For
-end-region instrumentation sites, however, there's a problem. In order
-to write the slot for end sites, I have to know (or know how to compute)
-the address of the return value of the corresponding start site. I had
-originally thought that I would just store this in the GBT, or
-"something", but I clearly didn't think through the problem well enough.
-
-There are only two ways I can think of that this can occur:
-
-(a) Write the effective address of the return value of the start inst
-func, so that it gets passed to the end inst func.
-
-or
-
-(b) Somehow encode the stack offset to the return value from the start
-inst, where the offset is from the %sp *at the end-region site*
-
-Both of these have problems.
-
-First, I don't think (b) can work at all, given that there may be
-alloca's present in the original application that would change the %sp,
-and thus the offset value that we'd need, and we can't determine the
-exact allocas that are executed statically.
-
-For (a), the effective address isn't known until runtime. We can store
-this address in some global table where the phase 4 invocation for the
-end site can find it, but it is not sufficient to have a single scalar
-address here -- we must have a stack, due to potential recursive
-invocations. I think that this is clear, please let me know if I'm not
-making sense. :)
-
-Hence, we'd need to maintain a stack of effective addresses, which was
-pushed during the execution of phase 5 for the start site, and then read
-and popped during the execution of phase 5 for the end site. We're
-already really bloated with how many instructions we've got going on for
-all of the spills, etc, and I'm concerned about the effect that this
-stack manipulation will have on our overhead, as we talked about before.
-
-The way I see it, we only have two options if we're to make forward
-progress and not obliterate our chances of having lower overhead
-numbers. Hopefully we have some better choices. In the interests of
-short-term forward progress, I'm going to go with #1 for now.
-
-#1 - Make another common-case assumption that there will be no allocas
-between start and end sites, on *any* control path. If this is the case,
-then we know that the stack pointer will not have been manipulated (I
-think) between the start and end sites, and so the %sp offsets to the
-requisite data will be unchanged since when the phase 5 step occurred
-for the start site.
-
-#2 - Just implement our fall-back solution that everything seems to be
-pointing to. I'm not sure exactly what other logistic nightmares might
-be entailed in this, though, because I've only a sketch of the idea.
-
-I wanted to point out, also, that the so-called "fall back" approach we
-discussed previous also involves manipulation of a stack at runtime
-(push/pop actions still have to occur at runtime), so perhaps the stack
-of effective addresses is less prohibitive than I thought, if only in
-the sense that we cannot avoid it. :(
- }}}
-
- - Write phase 5 stuff for end-region sites -- will assume that not allocas lie between
- the start and end sites, which is not particularly a fair assumption.
-
- - Optimizations:
- - No need to save registers (other than those clobbered) in phase 3 slot, since phase 3
- is invoked at the start of the function. Must still spill/restore shared, though.
- - No need to save registers (other than those clobbered) in general.
+ 1) New implementation for instrumenting at function-level granularity
+ 2) Apache through LLVM, experiments
+ 3) Writing, writing, writing: ICS version 2 paper, do
+ a) outline
+ b) intro
+ c) language section
+ d) compiler section
+
+ The three top-level items above are more-or-less interchangable. However, the
+ experiments will not be able to be completed unless the "instrumentation @
+ function-level granularity" implementation is done, and there should be an emphasis on
+ getting Apache through LLVM in the short-term because Chris is leaving for Norway in
+ early July.
+
+ However, *all* of the writing (including experimentation sections) for ICS v2 must be
+ done by the end of June, so that Adve can approve it, make the desired corrections,
+ etc., so that I can get started on thesis authorship and submission. Realistically,
+ the timeframe should look something like this:
+
+ Week of 6/10 (5 days): Implementation, Apache w/ bug reports
+ Week of 6/16 (6 days): Implementation, Small example, continue Apache & do tests
+
+ -- At this point, all experiments should be more-or-less completed --
+
+ Week of 6/23 (6 days): Write, write, write. 3 days nonstop for content, 3
+ days of Adve making corrections, etc.
+
+ Monday, 6/30 is D-Day...
+
+ Schedule revision 13 Jun 2003: Apache is in stasis waiting to hear back from
+ Chris. Some effort might be expended to see if any more work can be done on compiling
+ Apache even though the build process currently fails altogether. Implementation for
+ instrumentation at function-level granularity is in stasis until I hear back from
+ Vikram. This leaves three immediate options open until either of the two are
+ resolved, in which case forward progress on either the implementation or Apache should
+ be made.
+
+ Option 1) POV-Ray through LLVM.
+ Option 2) Writing
+ Option 3) "Small example"
+
+ Option 3 really shouldn't be undertaken until we know if obtaining things like I/O
+ elapsed time (via function-level) is possible, or until I can talk with Vikram in our
+ next meeting about what the heck this nebulous example should look like...this leaves
+ only options 1 and 2 above as viable.
+
+ POV-Ray should be an easy thing to start, and would be good both as a fall-back if
+ Apache isn't possible as well as a useful additional example if the latter is
+ possible. This would also give V & C some time to respond to the pending queries.
+
+ - Optimizations: - No need to save registers (other than those clobbered) in
+ phase 3 slot, since phase 3 is invoked at the start of the function. Must
+ still spill/restore shared, though. - No need to save registers (other than
+ those clobbered) in general.
}}}
@@ -1460,5 +1439,226 @@
Also, Chris remarked that any novel page-management mechanisms, etc., (for the
code that is jumped to in newly-allocated pages) that I devise should perhaps be
integrated into the LLVM JIT if they are suitable.
+
+}}}
+
+{{{ Experiments
+
+We must devise experiments that demonstrate the novel aspects of the work. We
+are currently planning on using Apache and/or POV-Ray, and demonstrating how a
+"deep" performance analysis can be encoded using performance primitives. A
+"deep" performance analysis is one which essentially (using Hollingsworth's
+terminology from his PhD thesis) gets at the "what, where, and when" aspects of
+performance bottlenecks. However, instead of doing this at the "arbitrary
+program during execution" level, as the W^3 search model does, we will encode
+these performance aspects at the application level itself.
+
+Here's what Vikram suggested for a good start:
+
+ I just thought that the examples of performance issues he explores in his
+ automatic search would give you (a) some insights into what performance issues
+ make sense to consider, and (b) some ideas about how to do a systematic
+ diagnosis, albeit at the application level.
+
+ But isn't detecting the "what, why, and when" of performance bottlenecks
+ pretty closely related to the goal of performance diagnosis? We'd be looking
+ for bottlenecks too, except that we can use application domain information, we
+ can look for bottlenecks at the algorithmic level instead of the general
+ system level, and we can record it permanently in the program as a first class
+ feature of the program.
+
+ Anyway, about your second question: here's a way to do what I suggesting:
+
+ -- think about 2-3 key performance issues with Apache (or POV-Ray) that you'd
+ want to diagnose e.g., cache misses, TLB misses, thread overhead (estimating
+ that could be interesting), I/O delays
+
+ -- if those issues make sense with a small sort example, try to diagnose those
+ issues in the small example first. e.g., I think cache misses, TLB misses,
+ and I/O delays would all be issues if you were sorting a huge file of some
+ kind.
+
+ This is purely to give you a small, well-understood code to try out before
+ going to the big ones where it may be difficult to know, when one diagnosis
+ attempt fails, whether it failed because you misunderstood the performance
+ issues or because the guess was wrong or both.
+
+TLB misses aren't an option, because we cannot get at them with PAPI. Cache
+misses are available, so that'd certainly be a good place to start. As for I/O
+delays, I have no idea how we'd measure this, either. The following is the
+comprehensive list of the low-level metrics that are exposed to us via PAPI. On
+our own, we can support simple things like elapsed time, load average, etc. I'm
+not altogether clear how we'd determine how much time (for a particular region)
+was spent doing I/O-bound activities...
+
+Number of hardware counters: 2
+Name: PAPI_L1_ICM Description: Level 1 instruction cache misses
+Name: PAPI_L2_TCM Description: Level 2 cache misses
+Name: PAPI_CA_SNP Description: Requests for a snoop
+Name: PAPI_CA_INV Description: Requests for cache line invalidation
+Name: PAPI_L1_LDM Description: Level 1 load misses
+Name: PAPI_L1_STM Description: Level 1 store misses
+Name: PAPI_BR_MSP Description: Conditional branch instructions mispredicted
+Name: PAPI_TOT_IIS Description: Instructions issued
+Name: PAPI_TOT_INS Description: Instructions completed
+Name: PAPI_LD_INS Description: Load instructions
+Name: PAPI_SR_INS Description: Store instructions
+Name: PAPI_TOT_CYC Description: Total cycles
+Name: PAPI_IPS Description: Instructions per second
+Name: PAPI_L1_DCR Description: Level 1 data cache reads
+Name: PAPI_L1_DCW Description: Level 1 data cache writes
+Name: PAPI_L1_ICH Description: Level 1 instruction cache hits
+Name: PAPI_L2_ICH Description: Level 2 instruction cache hits
+Name: PAPI_L1_ICA Description: Level 1 instruction cache accesses
+Name: PAPI_L2_TCH Description: Level 2 total cache hits
+Name: PAPI_L2_TCA Description: Level 2 total cache accesses
+
+So, in the short-term, we have two outstanding problems. First, what kinds of
+metric would we want to apply to Apache/POV-Ray/Simple example? Second, of those
+metrics, which can we actually realize with the current system? Third, we should
+create the simple program (such as sorting large amounts of data from a file,
+etc.) so that it can use these metrics in such a way that "models" or
+"anticipates" the way they will be used in the bigger applications.
+
+This is an outstanding issue, and I don't really know where to go with it yet.
+
+More notes about this as of 4 Jun 2003:
+
+One way to obtain metric values that are based on the elapsed times of
+particular functions is to somehow register instrumentation for those particular
+functions, and for a particular region -- Vikram argues that we have the ability
+to do this dynamically and we don't need any markers or phase-1 actions because
+we're operating at function-level granularity.
+
+Here is a sample scenario: We have defined an interval I over some scoped region
+of code. During phase 1 and phase 2, no instrumentation is registered for this
+interval. Later on, we construct a metric that is qualified by a list of
+functions that (for example) are to have their runtimes measured and added to
+some running total. Let's call this the "measure_functions_start" and
+"measure_functions_end" metric, and have it yield a value of type double which
+is the aggregate runtime of the list of functions when they get executed within
+interval I. The metric registration function will have to have some way
+(varargs?) of denoting what the functions are: perhaps it can simply pass in an
+array of function names together with a size.
+
+Example:
+
+pp_registerIntervalInst(intervalID, measure_functions_start,
+ measure_functions_end, &retVal, sizeof(double),
+ func1, func2, func3, ...);
+
+However, what do "measure_functions_start" have to do with anything? More than
+likely, what we need to do is specify a particular metric to apply to a
+paritcular function, such that the value will be sampled each time that function
+gets executed. Then, since there can be multiple invocations (and hence,
+multiple samples) for this selected function within I, we will have to have some
+default or user-specified way of aggregating the data :(. This is gross. In
+other words, we should probably simplify the above to something like:
+
+pp_registerIntervalInst(intervalID, some_metric_start, some_metric_end,
+ &retVal, sizeof(double), HOW_TO_AGGREGATE, func1);
+
+Where func1 is the function to be instrumented and HOW_TO_AGGREGATE is some
+value that specifies one from a couple of ways of combining the data. For now,
+HOW_TO_AGGREGATE will not exist and will implicitly sum all return
+values...hence, if the some_metric_{start,end} function ptrs above were to
+elapsed time, at the end of the interval I with interval id intervalID, retVal
+would contain the combined elapsed time of all time spent in function func1.
+
+Enabling this measurement at the start of the interval, disabling at the end of
+the interval; clearly, the start-instrumentation site will need to transform
+func1 (to compute the metric value) when crossed, and the end-insturmentation
+site must remove such instrumentation when crossed. This doesn't follow the
+normal model of what occurs at the instrumentation sites, in terms of just
+ripping down a list of functions...or does it? Perhaps one of the functions in
+the list is just this function that does the transformation on the target
+function...in this case, the process would look something like:
+
+1. Register the transformation function (for the start point) -- call this
+xform_start -- as a regular instrumentation function. The runtime call to do
+the registration will build the appropriate data structures which will encode
+what metric to associated with the target function, aggregation method, return
+value, etc.
+
+2. Register the transformation function (for the end point) -- call this
+xform_end -- as a regular instrumentation function.
+
+3. When xform_start is invoked as a regular instrumentation function, it will
+instrument the target function with the selected instrumentation.
+
+This is the hardest step to conceptualize and realize. The problem is that,
+without any placeholders from phase 1, it's not clear that we can instrument the
+target function easily. Clearly, our instrumentation points are at the start
+and end points of the function (entry and exit). But this is not really true.
+Our instrumentation points are really at the entry and *all* function exits.
+
+The important question is, what if we have all of the exit points at our
+disposal? Would that change anything?
+
+It would. The entry point together with all exit points would form a set of
+instrumentation points. For each of these instrumentation points, we could
+over-write a branch to a new slot that would call the desired instrumentation
+function, restore the replaced instruction, and return to the instrumentation
+point to continue execution. This would potentially work. One major problem
+that comes to mind is that for system calls (such as read()), the body of the
+function is highly likely to be out of short-jmp range to the
+tracecache-allocated slot. The only way around this would be to create a heap
+region and copy the target function into it, etc. We don't have any code to do
+this yet, so again there is no code reuse or (much) leveraging of existing
+functionality. Additionally, work will have to be done to make the data
+structure that maps the address of a function to its extents (which come from
+the ELf information).
+
+--
+
+The other alternative is to make sure a "wrapper" function (i.e., for the
+function read()):
+
+int read_wrapper(args for read) {
+ start_inst();
+ int rc = read(args for read);
+ end_inst();
+ return rc;
+}
+
+But this isn't an option because we cannot locate the calls to read() to replace
+them with the wrapper. We could do the following:
+
+3a. Copy the entire target function to a heap region, and instrument it to our
+heart's content. However, finding exit points may not be easy without a CFG,
+etc.
+
+3b. Replace the body of the real target function with a call to the modified
+duplicate of the target function, returning whatever the modified duplicate
+returns.
+
+This works, I think, but is incredibly cumbersome and, contrary to what was
+previously discussed, we do *not* possess all of the mechanisms.
+
+--
+
+Notes on POV-Ray experiment 23 Jun 2003
+
+We propose using a user-defined metric, flops/display pixel to compute the
+computational cost per pixel.
+
+1. Compute flops/dpixel and report using pp mechanisms.
+ 1a. flops can be obtained via PAPI, IIRC.
+
+2. Create a moving average of flops/dpixel.
+
+3. When the flops/dpixel exceeds mavg flops/dpixel by some multiplicative
+threshold (e.g.), a performance assertion is violated. When the PA is violated,
+we report other metrics (perhaps ranked in some manner) that had been being
+recorded but not reported. Initial suggestions for the other metrics to measure
+_over the same region_ (say, the main routine for a single ray-trace) are:
+
+ 1) MAverage L1 icache misses vs. this-ray L1 icache misses
+ 2) MAverage L2 cache misses vs. this-ray L2 cache misses
+ 3) MAverage load count vs. this-ray load count
+ 4) MAverage store count vs. this-ray store count
+
+Immediate problem: I don't think we can monitor more than 2 of these values
+concurrently due to hardware limitations on the SPARC...
}}}
More information about the llvm-commits
mailing list