[RFC] Heuristic for complete loop unrolling

Wed Feb 4 18:37:15 PST 2015

> On Feb 4, 2015, at 6:17 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
> ----- Original Message -----
>> From: "Michael Zolotukhin" <mzolotukhin at apple.com>
>> To: "Hal J. Finkel" <hfinkel at anl.gov>
>> Cc: "Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>
>> Sent: Wednesday, February 4, 2015 8:10:02 PM
>> Subject: Re: [RFC] Heuristic for complete loop unrolling
>> 
>> 
>> On Feb 4, 2015, at 5:42 PM, Hal Finkel < hfinkel at anl.gov > wrote:
>> 
>> ----- Original Message -----
>> 
>> 
>> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
>> To: "Hal J. Finkel" < hfinkel at anl.gov >
>> Cc: "Commit Messages and Patches for LLVM" < llvm-commits at cs.uiuc.edu
>>> 
>> Sent: Wednesday, February 4, 2015 1:29:34 PM
>> Subject: Re: [RFC] Heuristic for complete loop unrolling
>> 
>> On Feb 3, 2015, at 7:28 PM, Hal Finkel < hfinkel at anl.gov > wrote:
>> 
>> ----- Original Message -----
>> 
>> 
>> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
>> To: "Hal J. Finkel" < hfinkel at anl.gov >
>> Cc: "Commit Messages and Patches for LLVM" < llvm-commits at cs.uiuc.edu
>> 
>> 
>> 
>> Sent: Tuesday, February 3, 2015 7:28:48 PM
>> Subject: Re: [RFC] Heuristic for complete loop unrolling
>> 
>> 
>> Hi Hal,
>> 
>> 
>> These are updated versions of the patches:
>> 
>> I kept visiting only binary operators for now, but added a ‘TODO’
>> there. I’ll address it in later patches, since it looks like 1) it
>> can be done incrementally if we just add insn-visitors one by one,
>> 2) if we decide to reuse code from inlining (I prefer this option),
>> then It’ll be an independent and probably big effort for refactoring
>> this code first.
>> 
>> 
>> As for the grammar fixes, everything should be fixed now, thank you
>> for pointing me to them! Also, I feel like variables and options
>> names that I used are also not ideal - if you have better ideas for
>> them, please let me know.
>> 
>> 
>> Is it ok to commit the first patch?
>> 
>> Yes, but a few things to address first:
>> 
>> 1. There are a couple of places where you have dyn_cast that should
>> always succeed:
>> 
>> + NumberOfOptimizedInstructions +=
>> 
>> + TTI.getUserCost(dyn_cast<Instruction>(&I));
>> 
>> + LoadInst *LI = LoadDescr.first;
>> 
>> ...
>> + NumberOfOptimizedInstructions +=
>> 
>> + TTI.getUserCost(dyn_cast<Instruction>(LI));
>> 
>> 
>> + NumberOfOptimizedInstructions +=
>> 
>> + TTI.getUserCost(dyn_cast<Instruction>(I));
>> 
>> If you need a cast in these place at all (you might for the
>> iterators, but for the LoadInst * you shouldn't), use a cast<>, not
>> a dyn_cast<>.
>> Thanks, fixed! Updated patch attached.
>> 
>> 
>> 
>> 
>> 2. In EstimateNumberOfSimplifiedInsns(unsigned Iteration) we have:
>> 
>> + while (!Worklist.empty()) {
>> 
>> + Instruction *I = Worklist.pop_back_val();
>> 
>> + if (!visit(I))
>> 
>> + continue;
>> 
>> + for (auto U : I->users()) {
>> 
>> + Instruction *UI = dyn_cast<Instruction>(U);
>> 
>> + if (!UI)
>> 
>> + continue;
>> 
>> + if (!L->contains(UI))
>> 
>> + continue;
>> 
>> + Worklist.push_back(UI);
>> 
>> + }
>> 
>> + }
>> 
>> Worklist is a SmallVector, and so I think you might potentially visit
>> the same users more than once. For example, if you have two loads
>> that can be turned into constants, and then you add the results
>> together, the add will be visited twice. One possible solution is to
>> keep a Visited set, and explicitly avoid visiting the same user
>> twice. To do this, you'll need to make sure that you visit the user
>> graph breadth first (by which I mean such that you know you've
>> visited all relevant operands of an instructions before visiting the
>> instruction itself), not depth first as you currently do.
>> That’s true, we can visit some instructions twice, but I don't think
>> we can easily avoid that. Breadth first search wouldn’t solve this
>> problem, e.g. in the following case:
>> (1) %a = load 1
>> (2) %b = load 2
>> (3) %use_a1 = add %a, 5
>> (4) %use_a2 = sub %use_a1, 6
>> (5) %common_use = mul %use_a2, %b
>> Starting from loads (1) and (2), BFS will visit (3) and (5) first,
>> and only then (4). Thus, it will visit (5) before one of its
>> operands - (4).
>> 
>> 
>> We can try to use topological order here, but in this case we would
>> visit all users - currently we only visit those that have at least
>> one operand simplified. I’m not sure what’s the right call here. I
>> think that in real cases the current algorithm should be fine,
>> though asymptotically it’s worse, than using topological order. To
>> play safe, we can add a limitation, so that we’ll give up if number
>> of performed iterations is too high.
>> 
>> 
>> What do you think?
>> 
>> 
>> I think there is a simpler solution: Keep as Visited set, but don't
>> use it to directly prune the search. Since we only really care about
>> the number of simplified instructions, add simplified instructions
>> to the Visited set, and if the instruction is already in the set,
>> don't increment NumberOfOptimizedInstructions. So we'll visit
>> multiple times (potentially), but we won't over-count. Do you think
>> that will work?
>> Yes, I think that would work. Here is a new patch:
>> 
>> 
> 
> +    if (SimpleV && !CountedInsns.count(&I)) {
> 
> +      NumberOfOptimizedInstructions += TTI.getUserCost(&I);
> 
> +      CountedInsns.insert(&I);
> 
> +    }
> 
> 
> I think this can be:
> 
>    if (SimpleV && CountedInsns.insert(&I).second)
>      NumberOfOptimizedInstructions += TTI.getUserCost(&I);
> 
Fixed and committed in r228265. Thank you for your comments and remarks! I’ll proceed with the second patch shortly.

Michael
> 
> Otherwise, LGTM.
> 
> Thanks!
> 
> -Hal
> 
>> 
>> 
>> 
>> Thanks,
>> Michael
>> 
>> 
>> 
>> 
>> -Hal
>> 
>> 
>> 
>> 
>> Michael
>> 
>> 
>> 
>> 
>> 
>> 
>> Otherwise, LGTM. Feel free to fixup those things and commit (or I'll
>> look at it again; that's up to you).
>> 
>> 
>> 
>> 
>> 
>> As for the second patch - I implemented a new metrics as you
>> suggested (minimal percent of removed instructions instead of
>> ‘bonus’ for each removed instruction). However, I think we need to
>> have a hign (absolute) threshold as well here: even if we optimize
>> 50% of instructions, we don’t want to unroll a loop with 10^6
>> iterations - we’ll never finish compiling the current routine
>> otherwise.
>> 
>> Agreed.
>> 
>> 
>> 
>> I don’t plan to commit the second part in the current
>> state, but still post it to get a feedback.
>> 
>> Thanks, I think that the idea is reasonable.
>> 
>> -Hal
>> 
>> 
>> 
>> 
>> 
>> Thanks,
>> Michael
>> 
>> 
>> 
>> 
>> 
>> On Feb 2, 2015, at 9:38 PM, Hal Finkel < hfinkel at anl.gov > wrote:
>> 
>> ----- Original Message -----
>> 
>> 
>> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
>> To: "Hal J. Finkel" < hfinkel at anl.gov >
>> Cc: "Commit Messages and Patches for LLVM" < llvm-commits at cs.uiuc.edu
>> 
>> 
>> 
>> Sent: Monday, February 2, 2015 7:25:30 PM
>> Subject: Re: [RFC] Heuristic for complete loop unrolling
>> 
>> 
>> Hi Hal,
>> 
>> Please find a new version attached. I broke it into two parts: the
>> first one is the implementation of new heuristic, and the second is
>> for adjusting cost model.
>> 
>> 1. 0001-Implement-new-heuristic-for-complete-loop-unrolling.patch
>> 
>> I added a new class UnrollAnalyzer (like InlineAnalyzer), that
>> estimates possible optimization effects of complete-unrolling. Now
>> we simulate inst-simplify by visiting all users of loads that might
>> become constant, and then we simulate DCE, which also might perform
>> significant clean-ups here. The counted number of optimized
>> instruction then returned for further consideration (in this patch
>> we just add it to threshold - that’s a conservative way of treating
>> it, since after the mentioned optimizations we still should be under
>> the threshold).
>> 
>> 2. 0002-Use-estimated-number-of-optimized-insns-in-unroll-th.patch
>> 
>> In this patch we use more aggressive strategy in choosing threshold.
>> My idea here is that we can go beyond the default threshold if we
>> can significantly optimize the unrolled body, but we still should be
>> in reasonable limits. Let’s look at examples to illustrate it
>> better:
>> a) DefaultThreshold=150, LoopSize=50, IterationsNumber=5,
>> UnrolledLoopSize=50*5=250,
>> NumberOfPotentiallyOptimizedInstructions=90
>> In this case after unroll we would get 250 instructions, and after
>> inst-simplify+DCE, we’ll go down to 160 instructions. Though it’s
>> still bigger than DefaultThreshold, I think we do want to unroll
>> here, since it would speed up this part by ~90/250=36%.
>> 
>> 
>> b) DefaultThreshold=150, LoopSize=1000, IterationsNumber=1000,
>> UnrolledLoopSize=1000*1000,
>> NumberOfPotentiallyOptimizedInstructions=500. The absolute number of
>> optimized instructions in this example is bigger than in the
>> previous one, but we don’t want to unroll here, because the
>> resultant code would be huge, and we’d only save
>> 500/(1000*1000)=0.05% instructions.
>> 
>> 
>> To handle both situations, I suggest only unroll if we can optimize
>> significant portion of the loop body, e.g. if after unrolling we can
>> remove N% of code. We might want to have an absolute upper limit for
>> final code size too (i.e. even if we optimize 50% of instructions,
>> don’t unroll loop with 10^8 iterations), but for now I decided to
>> add only one new parameter to avoid over-complicating things.
>> 
>> 
>> In the patch I use weight for optimized instruction (yep, like in the
>> previous patch, but here it’s more meaningful), which is essentially
>> equivalent to a ratio between original and removed code size.
>> 
>> I'm not particularly keen on this bonus parameter, but do like the
>> percentage reduction test you've outlined. Can you please implement
>> that instead?
>> 
>> 
>> 
>> 
>> 
>> My testing is still in progress and probably I’ll do some tuning
>> later, the current default value (10) was taken almost at random.
>> 
>> Do these patches look better?
>> 
>> 
>> Yes, much better. A few comments on the first:
>> 
>> + // variable. Now it's time if it corresponds to a global constant
>> global
>> + // (in which case we can eliminate the load), or not.
>> 
>> ... time to see if ...
>> 
>> +// This class is used to get an estimate of optimization effect that
>> we could
>> 
>> optimization effect -> the optimization effects
>> 
>> +// get from complete loop unrolling. It comes from the fact that
>> some loads
>> +// might be replaced with a concrete constant values and that could
>> trigger a
>> +// chain of instruction simplifications.
>> 
>> a concrete -> concrete [remove 'a']
>> 
>> + bool visitInstruction(Instruction &I) { return false; };
>> + bool visitBinaryOperator(BinaryOperator &I) {
>> 
>> Visiting binary operators is good, but we should also visit ICmp,
>> FCmp, GetElementPtr, Trunc, ZExt, SExt, FPTrunc, FPExt, FPToUI,
>> FPToSI, UIToFP, SIToFP, BitCast, Select, ExtractElement,
>> InsertElement, ShuffleVector, ExtractValue, InsertValue. It might be
>> easier to just do this from the visitInstruction callback.
>> 
>> + unsigned ElemSize = CDS->getElementType()->getPrimitiveSizeInBits()
>> / 8U;
>> + unsigned Start = StartC.getLimitedValue();
>> + unsigned Step = StepC.getLimitedValue();
>> +
>> + unsigned Index = (Start + Step * Iteration) / ElemSize;
>> +
>> + Constant *CV = CDS->getElementAsConstant(Index);
>> 
>> I think you need to guard here against out-of-bounds accesses (we
>> should simply return nullptr and not assert or segfault from the OOB
>> access to the CDS.
>> 
>> + // Visit all loads the loop L, and for those that after complete
>> loop
>> + // unrolling would have a constant address and it will point to
>> from a known
>> 
>> ... that, after complete loop unrolling, would ... [add commas]
>> 
>> also
>> 
>> it will point to from a known -> it will point to a known [remove
>> 'from']
>> 
>> +// This routine estimates this optimization effect and return number
>> of
>> +// instructions, that potentially might be optimized away.
>> 
>> return number -> returns the number
>> 
>> + // iterations here. To limit ourselves here, we check only first
>> 1000
>> + // iterations, and then scale the found number, if necessary.
>> + unsigned IterationsNumberForEstimate = std::min(1000u, TripCount);
>> 
>> Don't embed 1000 here; make this a cl::opt.
>> 
>> With these changes, the first patch LGTM.
>> 
>> Thanks again,
>> Hal
>> 
>> 
>> 
>> 
>> Thanks,
>> Michael
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Jan 26, 2015, at 11:50 AM, Michael Zolotukhin <
>> mzolotukhin at apple.com > wrote:
>> 
>> 
>> 
>> 
>> 
>> 
>> On Jan 25, 2015, at 6:06 AM, Hal Finkel < hfinkel at anl.gov > wrote:
>> 
>> ----- Original Message -----
>> 
>> 
>> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
>> To: "Hal Finkel" < hfinkel at anl.gov >
>> Cc: "Arnold Schwaighofer" < aschwaighofer at apple.com >, "Commit
>> Messages and Patches for LLVM"
>> < llvm-commits at cs.uiuc.edu >
>> Sent: Sunday, January 25, 2015 1:20:18 AM
>> Subject: Re: [RFC] Heuristic for complete loop unrolling
>> 
>> 
>> On Jan 24, 2015, at 12:38 PM, Hal Finkel < hfinkel at anl.gov > wrote:
>> 
>> ----- Original Message -----
>> 
>> 
>> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
>> To: "Hal Finkel" < hfinkel at anl.gov >
>> Cc: "Arnold Schwaighofer" < aschwaighofer at apple.com >, "Commit
>> Messages and Patches for LLVM"
>> < llvm-commits at cs.uiuc.edu >
>> Sent: Saturday, January 24, 2015 2:26:03 PM
>> Subject: Re: [RFC] Heuristic for complete loop unrolling
>> 
>> 
>> Hi Hal,
>> 
>> 
>> Thanks for the review! Please see my comments inline.
>> 
>> 
>> 
>> 
>> On Jan 24, 2015, at 6:44 AM, Hal Finkel < hfinkel at anl.gov > wrote:
>> 
>> [moving patch review to llvm-commits]
>> 
>> +static bool CanEliminateLoadFrom(Value *V) {
>> 
>> + if (Constant *C = dyn_cast<Constant>(V)) {
>> 
>> + if (GlobalVariable *GV = dyn_cast<GlobalVariable>(C))
>> 
>> + if (GV->isConstant() && GV->hasDefinitiveInitializer())
>> 
>> 
>> Why are you casting to a Constant, and then to a GlobalVariable, and
>> then checking GV->isConstant()? There seems to be unnecessary
>> redundancy here ;)
>> Indeed:)
>> 
>> 
>> 
>> 
>> + return GV->getInitializer();
>> 
>> + }
>> 
>> + return false;
>> 
>> +}
>> 
>> +static unsigned ApproximateNumberOfEliminatedInstruction(const Loop
>> *L,
>> 
>> + ScalarEvolution &SE) {
>> 
>> This function seems mis-named. For one thing, it is not counting the
>> number of instructions potentially eliminated, but the number of
>> loads. Eliminating the loads might have lead to further constant
>> propagation, and really calculating the number of eliminated
>> instructions would require estimating that effect, right?
>> That’s right. But I’d rather change the function name, than add such
>> calculations, since it’ll look very-very narrow targeted, and I
>> worry that we might start to lose some cases as well. How about
>> ‘NumberOfConstantFoldedLoads’?
>> 
>> 
>> Sounds good to me.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> if (TripCount && Count == TripCount) {
>> 
>> - if (Threshold != NoThreshold && UnrolledSize > Threshold) {
>> 
>> + if (Threshold != NoThreshold && UnrolledSize > Threshold + 20 *
>> ElimInsns) {
>> 
>> 
>> 20, huh? Is this a heuristic for constant propagation? It feels like
>> we should be able to do better than this.
>> Yep, that’s a parameter of the heuristic. Counting each ‘constant'
>> load as 1 is too conservative and doesn’t give much here,
>> 
>> Understood.
>> 
>> 
>> 
>> but since
>> we don’t actually count number of eliminated instructions we need
>> some estimate for it. This estimate is really rough, since in some
>> cases we can eliminate the entire loop body, while in the others we
>> can’t eliminate anything.
>> 
>> I think that we might as well estimate it. Once we know the loads
>> that unrolling will constant fold, put them into a SmallPtrSet S.
>> Then, use a worklist-based iteration of the uses of instructions in
>> S. For each use that has operands that are all either constants, or
>> in S, queue them and keep going. Make sure you walk breadth first
>> (not depth first) so that you capture things will multiple feeder
>> loads properly. As you do this, count the number of instructions
>> that will be constant folded (keeping a Visited set in the usual way
>> so you don't over-count). This will be an estimate, but should be a
>> very accurate one, and can be done without any heuristic parameters
>> (and you still only visit a linear number of instructions, so should
>> not be too expensive).
>> The biggest gain comes not from expressions buf[0]*buf[3], which can
>> be constant folded when buf[0] and buf[3] are substituted with
>> constants, but from x*buf[0], when buf[0] turns out to be 0, or 1.
>> I.e. we don’t constant-fold it, but we simplify the expression based
>> on the particular value. I.e. the original convolution test is in
>> some sense equivalent to the following code:
>> const int a[] = [0, 1, 5, 1, 0];
>> int *b;
>> for(i = 0; i < 100; i ++) {
>> for(j = 0; j < 5; j++)
>> b[i] += b[i+j]*a[j];
>> }
>> 
>> 
>> Complete unrolling will give us:
>> 
>> for(i = 0; i < 100; i ++) {
>> b[i] += b[i]*a[0];
>> 
>> b[i] += b[i+1]*a[1];
>> 
>> b[i] += b[i+2]*a[2];
>> 
>> b[i] += b[i+3]*a[3];
>> 
>> b[i] += b[i+4]*a[4];
>> }
>> 
>> 
>> After const-prop we’ll get:
>> 
>> for(i = 0; i < 100; i ++) {
>> b[i] += b[i]*0;
>> 
>> b[i] += b[i+1]*1;
>> 
>> b[i] += b[i+2]*5;
>> 
>> b[i] += b[i+3]*1;
>> 
>> b[i] += b[i+4]*0;
>> }
>> And after simplification:
>> 
>> for(i = 0; i < 100; i ++) {
>> b[i] += b[i+1];
>> 
>> 
>> b[i] += b[i+2]*5;
>> 
>> b[i] += b[i+3];
>> 
>> }
>> 
>> 
>> As you can see, we actually folded nothing here, but rather we
>> simplified more than a half of instructions. But it’s hard to
>> predict, which optimizations will be enabled by exact values
>> replaced loads
>> 
>> Agreed.
>> 
>> - i.e. of course we can check it, but it would be too
>> 
>> 
>> optimization-specific in my opinion.
>> 
>> I understand your point, but I think using one magic number is
>> swinging the pendulum too far in the other direction.
>> 
>> 
>> 
>> Thus, I decided to use some
>> number (20 in the current patch) which represents some average
>> profitability of replacing a load with constant. I think I should
>> get rid of this magic number though, and replace it with some
>> target-specific parameter (that’ll help to address Owen’s
>> suggestions).
>> 
>> Using a target-specific cost (or costs) is a good idea, but having
>> target-specific magic numbers is going to be a mess. Different loops
>> will unroll, or not, for fairly arbitrary reasons, on different
>> targets. This is a heuristic, that I understand, and you won't be
>> able to exactly predict all possible later simplifications enabled
>> by the unrolling. However, the fact that you currently need a large
>> per-instruction boost factor, like 20, I think, means that the model
>> is too coarse.
>> 
>> Maybe it would not be unreasonable to start from the other side: If
>> we first located loads that could be constant folded, and determined
>> the values those loads would actually take, and then simplified the
>> loop instructions based on those values, we'd get a pretty good
>> estimate. This is essentially what the code in
>> lib/Analysis/IPA/InlineCost.cpp does when computing the inlining
>> cost savings, and I think we could use the same technique here.
>> Admittedly, the inline cost analysis is relatively expensive, but
>> I'd think we could impose a reasonable size cutoff to limit the
>> overall expense for the case of full unrolling -- for this, maybe
>> the boost factor of 20 is appropriate -- and would also address
>> Owen's point, as far as I can tell, because like the inline cost
>> analysis, we can use TTI to compute the target-specific cost of
>> GEPs, etc. Ideally, we could refactor the ICA a little bit, and
>> re-use the code in there directly for this purpose. This way we can
>> limit the number of places in which we compute similar kinds of
>> heuristic simplification costs.
>> Hi Hal,
>> 
>> 
>> Thanks for the comments. I think that’s a good idea, and I’ll try
>> that out. If that doesn’t lead to over-complicated implementation,
>> I’d like it much better than having a magic number. I’ll prepare a
>> new patch soon.
>> 
>> 
>> Thanks,
>> Michael
>> 
>> 
>> 
>> 
>> 
>> -Hal
>> 
>> 
>> 
>> 
>> 
>> Thanks,
>> Michael
>> 
>> 
>> 
>> 
>> 
>> -Hal
>> 
>> 
>> 
>> 
>> 
>> I’d really appreciate a feedback on how to model this in a better
>> way.
>> 
>> 
>> Thanks,
>> Michael
>> 
>> 
>> 
>> 
>> Thanks again,
>> Hal
>> 
>> ----- Original Message -----
>> 
>> 
>> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
>> To: "LLVM Developers Mailing List ( llvmdev at cs.uiuc.edu )" <
>> llvmdev at cs.uiuc.edu >
>> Cc: "Hal J. Finkel" < hfinkel at anl.gov >, "Arnold Schwaighofer" <
>> aschwaighofer at apple.com >
>> Sent: Friday, January 23, 2015 2:05:11 PM
>> Subject: [RFC] Heuristic for complete loop unrolling
>> 
>> 
>> 
>> Hi devs,
>> 
>> Recently I came across an interesting testcase that LLVM failed to
>> optimize well. The test does some image processing, and as a part of
>> it, it traverses all the pixels and computes some value basing on
>> the adjacent pixels. So, the hot part looks like this:
>> 
>> for(y = 0..height) {
>> for (x = 0..width) {
>> val = 0
>> for (j = 0..5) {
>> for (i = 0..5) {
>> val += img[x+i,y+j] * weight[i,j]
>> }
>> }
>> }
>> }
>> 
>> And ‘weight' is just a constant matrix with some coefficients.
>> 
>> If we unroll the two internal loops (with tripcount 5), then we can
>> replace weight[i,j] with concrete constant values. In this
>> particular case, many of the coefficients are actually 0 or 1, which
>> enables huge code simplifications later on. But currently we unroll
>> only the innermost one, because unrolling both of them will exceed
>> the threshold.
>> 
>> When deciding whether to unroll or not, we currently look only at the
>> instruction count of the loop. My proposal is to, on top of that,
>> check if we can enable any later optimizations by unrolling - in
>> this case by replacing a load with a constant. Similar to what we do
>> in inlining heuristics, we can estimate how many instructions would
>> be potentially eliminated after unrolling and adjust our threshold
>> with this value.
>> 
>> I can imagine that it might be also useful for computations,
>> involving sparse constant matrixes (like identity matrix).
>> 
>> The attached patch implements this feature, and with it we handle the
>> original testcase well.
>> 
>> 
>> 
>> 
>> 
>> Does it look good? Of course, any ideas, suggestions and other
>> feedback are welcome!
>> 
>> 
>> Thanks,
>> Michael
>> 
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>> 
>> 
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>> 
>> 
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> 
>> 
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>> 
>> 
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>> 
>> 
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150204/59dc135d/attachment.html>