[RFC] Heuristic for complete loop unrolling

Wed Feb 4 18:17:28 PST 2015

----- Original Message -----
> From: "Michael Zolotukhin" <mzolotukhin at apple.com>
> To: "Hal J. Finkel" <hfinkel at anl.gov>
> Cc: "Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>
> Sent: Wednesday, February 4, 2015 8:10:02 PM
> Subject: Re: [RFC] Heuristic for complete loop unrolling
> 
> 
> On Feb 4, 2015, at 5:42 PM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> ----- Original Message -----
> 
> 
> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
> To: "Hal J. Finkel" < hfinkel at anl.gov >
> Cc: "Commit Messages and Patches for LLVM" < llvm-commits at cs.uiuc.edu
> >
> Sent: Wednesday, February 4, 2015 1:29:34 PM
> Subject: Re: [RFC] Heuristic for complete loop unrolling
> 
> On Feb 3, 2015, at 7:28 PM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> ----- Original Message -----
> 
> 
> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
> To: "Hal J. Finkel" < hfinkel at anl.gov >
> Cc: "Commit Messages and Patches for LLVM" < llvm-commits at cs.uiuc.edu
> 
> 
> 
> Sent: Tuesday, February 3, 2015 7:28:48 PM
> Subject: Re: [RFC] Heuristic for complete loop unrolling
> 
> 
> Hi Hal,
> 
> 
> These are updated versions of the patches:
> 
> I kept visiting only binary operators for now, but added a ‘TODO’
> there. I’ll address it in later patches, since it looks like 1) it
> can be done incrementally if we just add insn-visitors one by one,
> 2) if we decide to reuse code from inlining (I prefer this option),
> then It’ll be an independent and probably big effort for refactoring
> this code first.
> 
> 
> As for the grammar fixes, everything should be fixed now, thank you
> for pointing me to them! Also, I feel like variables and options
> names that I used are also not ideal - if you have better ideas for
> them, please let me know.
> 
> 
> Is it ok to commit the first patch?
> 
> Yes, but a few things to address first:
> 
> 1. There are a couple of places where you have dyn_cast that should
> always succeed:
> 
> + NumberOfOptimizedInstructions +=
> 
> + TTI.getUserCost(dyn_cast<Instruction>(&I));
> 
> + LoadInst *LI = LoadDescr.first;
> 
> ...
> + NumberOfOptimizedInstructions +=
> 
> + TTI.getUserCost(dyn_cast<Instruction>(LI));
> 
> 
> + NumberOfOptimizedInstructions +=
> 
> + TTI.getUserCost(dyn_cast<Instruction>(I));
> 
> If you need a cast in these place at all (you might for the
> iterators, but for the LoadInst * you shouldn't), use a cast<>, not
> a dyn_cast<>.
> Thanks, fixed! Updated patch attached.
> 
> 
> 
> 
> 2. In EstimateNumberOfSimplifiedInsns(unsigned Iteration) we have:
> 
> + while (!Worklist.empty()) {
> 
> + Instruction *I = Worklist.pop_back_val();
> 
> + if (!visit(I))
> 
> + continue;
> 
> + for (auto U : I->users()) {
> 
> + Instruction *UI = dyn_cast<Instruction>(U);
> 
> + if (!UI)
> 
> + continue;
> 
> + if (!L->contains(UI))
> 
> + continue;
> 
> + Worklist.push_back(UI);
> 
> + }
> 
> + }
> 
> Worklist is a SmallVector, and so I think you might potentially visit
> the same users more than once. For example, if you have two loads
> that can be turned into constants, and then you add the results
> together, the add will be visited twice. One possible solution is to
> keep a Visited set, and explicitly avoid visiting the same user
> twice. To do this, you'll need to make sure that you visit the user
> graph breadth first (by which I mean such that you know you've
> visited all relevant operands of an instructions before visiting the
> instruction itself), not depth first as you currently do.
> That’s true, we can visit some instructions twice, but I don't think
> we can easily avoid that. Breadth first search wouldn’t solve this
> problem, e.g. in the following case:
> (1) %a = load 1
> (2) %b = load 2
> (3) %use_a1 = add %a, 5
> (4) %use_a2 = sub %use_a1, 6
> (5) %common_use = mul %use_a2, %b
> Starting from loads (1) and (2), BFS will visit (3) and (5) first,
> and only then (4). Thus, it will visit (5) before one of its
> operands - (4).
> 
> 
> We can try to use topological order here, but in this case we would
> visit all users - currently we only visit those that have at least
> one operand simplified. I’m not sure what’s the right call here. I
> think that in real cases the current algorithm should be fine,
> though asymptotically it’s worse, than using topological order. To
> play safe, we can add a limitation, so that we’ll give up if number
> of performed iterations is too high.
> 
> 
> What do you think?
> 
> 
> I think there is a simpler solution: Keep as Visited set, but don't
> use it to directly prune the search. Since we only really care about
> the number of simplified instructions, add simplified instructions
> to the Visited set, and if the instruction is already in the set,
> don't increment NumberOfOptimizedInstructions. So we'll visit
> multiple times (potentially), but we won't over-count. Do you think
> that will work?
> Yes, I think that would work. Here is a new patch:
> 
> 

+    if (SimpleV && !CountedInsns.count(&I)) {

+      NumberOfOptimizedInstructions += TTI.getUserCost(&I);

+      CountedInsns.insert(&I);

+    }

I think this can be:

    if (SimpleV && CountedInsns.insert(&I).second)
      NumberOfOptimizedInstructions += TTI.getUserCost(&I);

Otherwise, LGTM.

Thanks!

 -Hal

> 
> 
> 
> Thanks,
> Michael
> 
> 
> 
> 
> -Hal
> 
> 
> 
> 
> Michael
> 
> 
> 
> 
> 
> 
> Otherwise, LGTM. Feel free to fixup those things and commit (or I'll
> look at it again; that's up to you).
> 
> 
> 
> 
> 
> As for the second patch - I implemented a new metrics as you
> suggested (minimal percent of removed instructions instead of
> ‘bonus’ for each removed instruction). However, I think we need to
> have a hign (absolute) threshold as well here: even if we optimize
> 50% of instructions, we don’t want to unroll a loop with 10^6
> iterations - we’ll never finish compiling the current routine
> otherwise.
> 
> Agreed.
> 
> 
> 
> I don’t plan to commit the second part in the current
> state, but still post it to get a feedback.
> 
> Thanks, I think that the idea is reasonable.
> 
> -Hal
> 
> 
> 
> 
> 
> Thanks,
> Michael
> 
> 
> 
> 
> 
> On Feb 2, 2015, at 9:38 PM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> ----- Original Message -----
> 
> 
> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
> To: "Hal J. Finkel" < hfinkel at anl.gov >
> Cc: "Commit Messages and Patches for LLVM" < llvm-commits at cs.uiuc.edu
> 
> 
> 
> Sent: Monday, February 2, 2015 7:25:30 PM
> Subject: Re: [RFC] Heuristic for complete loop unrolling
> 
> 
> Hi Hal,
> 
> Please find a new version attached. I broke it into two parts: the
> first one is the implementation of new heuristic, and the second is
> for adjusting cost model.
> 
> 1. 0001-Implement-new-heuristic-for-complete-loop-unrolling.patch
> 
> I added a new class UnrollAnalyzer (like InlineAnalyzer), that
> estimates possible optimization effects of complete-unrolling. Now
> we simulate inst-simplify by visiting all users of loads that might
> become constant, and then we simulate DCE, which also might perform
> significant clean-ups here. The counted number of optimized
> instruction then returned for further consideration (in this patch
> we just add it to threshold - that’s a conservative way of treating
> it, since after the mentioned optimizations we still should be under
> the threshold).
> 
> 2. 0002-Use-estimated-number-of-optimized-insns-in-unroll-th.patch
> 
> In this patch we use more aggressive strategy in choosing threshold.
> My idea here is that we can go beyond the default threshold if we
> can significantly optimize the unrolled body, but we still should be
> in reasonable limits. Let’s look at examples to illustrate it
> better:
> a) DefaultThreshold=150, LoopSize=50, IterationsNumber=5,
> UnrolledLoopSize=50*5=250,
> NumberOfPotentiallyOptimizedInstructions=90
> In this case after unroll we would get 250 instructions, and after
> inst-simplify+DCE, we’ll go down to 160 instructions. Though it’s
> still bigger than DefaultThreshold, I think we do want to unroll
> here, since it would speed up this part by ~90/250=36%.
> 
> 
> b) DefaultThreshold=150, LoopSize=1000, IterationsNumber=1000,
> UnrolledLoopSize=1000*1000,
> NumberOfPotentiallyOptimizedInstructions=500. The absolute number of
> optimized instructions in this example is bigger than in the
> previous one, but we don’t want to unroll here, because the
> resultant code would be huge, and we’d only save
> 500/(1000*1000)=0.05% instructions.
> 
> 
> To handle both situations, I suggest only unroll if we can optimize
> significant portion of the loop body, e.g. if after unrolling we can
> remove N% of code. We might want to have an absolute upper limit for
> final code size too (i.e. even if we optimize 50% of instructions,
> don’t unroll loop with 10^8 iterations), but for now I decided to
> add only one new parameter to avoid over-complicating things.
> 
> 
> In the patch I use weight for optimized instruction (yep, like in the
> previous patch, but here it’s more meaningful), which is essentially
> equivalent to a ratio between original and removed code size.
> 
> I'm not particularly keen on this bonus parameter, but do like the
> percentage reduction test you've outlined. Can you please implement
> that instead?
> 
> 
> 
> 
> 
> My testing is still in progress and probably I’ll do some tuning
> later, the current default value (10) was taken almost at random.
> 
> Do these patches look better?
> 
> 
> Yes, much better. A few comments on the first:
> 
> + // variable. Now it's time if it corresponds to a global constant
> global
> + // (in which case we can eliminate the load), or not.
> 
> ... time to see if ...
> 
> +// This class is used to get an estimate of optimization effect that
> we could
> 
> optimization effect -> the optimization effects
> 
> +// get from complete loop unrolling. It comes from the fact that
> some loads
> +// might be replaced with a concrete constant values and that could
> trigger a
> +// chain of instruction simplifications.
> 
> a concrete -> concrete [remove 'a']
> 
> + bool visitInstruction(Instruction &I) { return false; };
> + bool visitBinaryOperator(BinaryOperator &I) {
> 
> Visiting binary operators is good, but we should also visit ICmp,
> FCmp, GetElementPtr, Trunc, ZExt, SExt, FPTrunc, FPExt, FPToUI,
> FPToSI, UIToFP, SIToFP, BitCast, Select, ExtractElement,
> InsertElement, ShuffleVector, ExtractValue, InsertValue. It might be
> easier to just do this from the visitInstruction callback.
> 
> + unsigned ElemSize = CDS->getElementType()->getPrimitiveSizeInBits()
> / 8U;
> + unsigned Start = StartC.getLimitedValue();
> + unsigned Step = StepC.getLimitedValue();
> +
> + unsigned Index = (Start + Step * Iteration) / ElemSize;
> +
> + Constant *CV = CDS->getElementAsConstant(Index);
> 
> I think you need to guard here against out-of-bounds accesses (we
> should simply return nullptr and not assert or segfault from the OOB
> access to the CDS.
> 
> + // Visit all loads the loop L, and for those that after complete
> loop
> + // unrolling would have a constant address and it will point to
> from a known
> 
> ... that, after complete loop unrolling, would ... [add commas]
> 
> also
> 
> it will point to from a known -> it will point to a known [remove
> 'from']
> 
> +// This routine estimates this optimization effect and return number
> of
> +// instructions, that potentially might be optimized away.
> 
> return number -> returns the number
> 
> + // iterations here. To limit ourselves here, we check only first
> 1000
> + // iterations, and then scale the found number, if necessary.
> + unsigned IterationsNumberForEstimate = std::min(1000u, TripCount);
> 
> Don't embed 1000 here; make this a cl::opt.
> 
> With these changes, the first patch LGTM.
> 
> Thanks again,
> Hal
> 
> 
> 
> 
> Thanks,
> Michael
> 
> 
> 
> 
> 
> 
> 
> On Jan 26, 2015, at 11:50 AM, Michael Zolotukhin <
> mzolotukhin at apple.com > wrote:
> 
> 
> 
> 
> 
> 
> On Jan 25, 2015, at 6:06 AM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> ----- Original Message -----
> 
> 
> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
> To: "Hal Finkel" < hfinkel at anl.gov >
> Cc: "Arnold Schwaighofer" < aschwaighofer at apple.com >, "Commit
> Messages and Patches for LLVM"
> < llvm-commits at cs.uiuc.edu >
> Sent: Sunday, January 25, 2015 1:20:18 AM
> Subject: Re: [RFC] Heuristic for complete loop unrolling
> 
> 
> On Jan 24, 2015, at 12:38 PM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> ----- Original Message -----
> 
> 
> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
> To: "Hal Finkel" < hfinkel at anl.gov >
> Cc: "Arnold Schwaighofer" < aschwaighofer at apple.com >, "Commit
> Messages and Patches for LLVM"
> < llvm-commits at cs.uiuc.edu >
> Sent: Saturday, January 24, 2015 2:26:03 PM
> Subject: Re: [RFC] Heuristic for complete loop unrolling
> 
> 
> Hi Hal,
> 
> 
> Thanks for the review! Please see my comments inline.
> 
> 
> 
> 
> On Jan 24, 2015, at 6:44 AM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> [moving patch review to llvm-commits]
> 
> +static bool CanEliminateLoadFrom(Value *V) {
> 
> + if (Constant *C = dyn_cast<Constant>(V)) {
> 
> + if (GlobalVariable *GV = dyn_cast<GlobalVariable>(C))
> 
> + if (GV->isConstant() && GV->hasDefinitiveInitializer())
> 
> 
> Why are you casting to a Constant, and then to a GlobalVariable, and
> then checking GV->isConstant()? There seems to be unnecessary
> redundancy here ;)
> Indeed:)
> 
> 
> 
> 
> + return GV->getInitializer();
> 
> + }
> 
> + return false;
> 
> +}
> 
> +static unsigned ApproximateNumberOfEliminatedInstruction(const Loop
> *L,
> 
> + ScalarEvolution &SE) {
> 
> This function seems mis-named. For one thing, it is not counting the
> number of instructions potentially eliminated, but the number of
> loads. Eliminating the loads might have lead to further constant
> propagation, and really calculating the number of eliminated
> instructions would require estimating that effect, right?
> That’s right. But I’d rather change the function name, than add such
> calculations, since it’ll look very-very narrow targeted, and I
> worry that we might start to lose some cases as well. How about
> ‘NumberOfConstantFoldedLoads’?
> 
> 
> Sounds good to me.
> 
> 
> 
> 
> 
> 
> 
> if (TripCount && Count == TripCount) {
> 
> - if (Threshold != NoThreshold && UnrolledSize > Threshold) {
> 
> + if (Threshold != NoThreshold && UnrolledSize > Threshold + 20 *
> ElimInsns) {
> 
> 
> 20, huh? Is this a heuristic for constant propagation? It feels like
> we should be able to do better than this.
> Yep, that’s a parameter of the heuristic. Counting each ‘constant'
> load as 1 is too conservative and doesn’t give much here,
> 
> Understood.
> 
> 
> 
> but since
> we don’t actually count number of eliminated instructions we need
> some estimate for it. This estimate is really rough, since in some
> cases we can eliminate the entire loop body, while in the others we
> can’t eliminate anything.
> 
> I think that we might as well estimate it. Once we know the loads
> that unrolling will constant fold, put them into a SmallPtrSet S.
> Then, use a worklist-based iteration of the uses of instructions in
> S. For each use that has operands that are all either constants, or
> in S, queue them and keep going. Make sure you walk breadth first
> (not depth first) so that you capture things will multiple feeder
> loads properly. As you do this, count the number of instructions
> that will be constant folded (keeping a Visited set in the usual way
> so you don't over-count). This will be an estimate, but should be a
> very accurate one, and can be done without any heuristic parameters
> (and you still only visit a linear number of instructions, so should
> not be too expensive).
> The biggest gain comes not from expressions buf[0]*buf[3], which can
> be constant folded when buf[0] and buf[3] are substituted with
> constants, but from x*buf[0], when buf[0] turns out to be 0, or 1.
> I.e. we don’t constant-fold it, but we simplify the expression based
> on the particular value. I.e. the original convolution test is in
> some sense equivalent to the following code:
> const int a[] = [0, 1, 5, 1, 0];
> int *b;
> for(i = 0; i < 100; i ++) {
> for(j = 0; j < 5; j++)
> b[i] += b[i+j]*a[j];
> }
> 
> 
> Complete unrolling will give us:
> 
> for(i = 0; i < 100; i ++) {
> b[i] += b[i]*a[0];
> 
> b[i] += b[i+1]*a[1];
> 
> b[i] += b[i+2]*a[2];
> 
> b[i] += b[i+3]*a[3];
> 
> b[i] += b[i+4]*a[4];
> }
> 
> 
> After const-prop we’ll get:
> 
> for(i = 0; i < 100; i ++) {
> b[i] += b[i]*0;
> 
> b[i] += b[i+1]*1;
> 
> b[i] += b[i+2]*5;
> 
> b[i] += b[i+3]*1;
> 
> b[i] += b[i+4]*0;
> }
> And after simplification:
> 
> for(i = 0; i < 100; i ++) {
> b[i] += b[i+1];
> 
> 
> b[i] += b[i+2]*5;
> 
> b[i] += b[i+3];
> 
> }
> 
> 
> As you can see, we actually folded nothing here, but rather we
> simplified more than a half of instructions. But it’s hard to
> predict, which optimizations will be enabled by exact values
> replaced loads
> 
> Agreed.
> 
> - i.e. of course we can check it, but it would be too
> 
> 
> optimization-specific in my opinion.
> 
> I understand your point, but I think using one magic number is
> swinging the pendulum too far in the other direction.
> 
> 
> 
> Thus, I decided to use some
> number (20 in the current patch) which represents some average
> profitability of replacing a load with constant. I think I should
> get rid of this magic number though, and replace it with some
> target-specific parameter (that’ll help to address Owen’s
> suggestions).
> 
> Using a target-specific cost (or costs) is a good idea, but having
> target-specific magic numbers is going to be a mess. Different loops
> will unroll, or not, for fairly arbitrary reasons, on different
> targets. This is a heuristic, that I understand, and you won't be
> able to exactly predict all possible later simplifications enabled
> by the unrolling. However, the fact that you currently need a large
> per-instruction boost factor, like 20, I think, means that the model
> is too coarse.
> 
> Maybe it would not be unreasonable to start from the other side: If
> we first located loads that could be constant folded, and determined
> the values those loads would actually take, and then simplified the
> loop instructions based on those values, we'd get a pretty good
> estimate. This is essentially what the code in
> lib/Analysis/IPA/InlineCost.cpp does when computing the inlining
> cost savings, and I think we could use the same technique here.
> Admittedly, the inline cost analysis is relatively expensive, but
> I'd think we could impose a reasonable size cutoff to limit the
> overall expense for the case of full unrolling -- for this, maybe
> the boost factor of 20 is appropriate -- and would also address
> Owen's point, as far as I can tell, because like the inline cost
> analysis, we can use TTI to compute the target-specific cost of
> GEPs, etc. Ideally, we could refactor the ICA a little bit, and
> re-use the code in there directly for this purpose. This way we can
> limit the number of places in which we compute similar kinds of
> heuristic simplification costs.
> Hi Hal,
> 
> 
> Thanks for the comments. I think that’s a good idea, and I’ll try
> that out. If that doesn’t lead to over-complicated implementation,
> I’d like it much better than having a magic number. I’ll prepare a
> new patch soon.
> 
> 
> Thanks,
> Michael
> 
> 
> 
> 
> 
> -Hal
> 
> 
> 
> 
> 
> Thanks,
> Michael
> 
> 
> 
> 
> 
> -Hal
> 
> 
> 
> 
> 
> I’d really appreciate a feedback on how to model this in a better
> way.
> 
> 
> Thanks,
> Michael
> 
> 
> 
> 
> Thanks again,
> Hal
> 
> ----- Original Message -----
> 
> 
> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
> To: "LLVM Developers Mailing List ( llvmdev at cs.uiuc.edu )" <
> llvmdev at cs.uiuc.edu >
> Cc: "Hal J. Finkel" < hfinkel at anl.gov >, "Arnold Schwaighofer" <
> aschwaighofer at apple.com >
> Sent: Friday, January 23, 2015 2:05:11 PM
> Subject: [RFC] Heuristic for complete loop unrolling
> 
> 
> 
> Hi devs,
> 
> Recently I came across an interesting testcase that LLVM failed to
> optimize well. The test does some image processing, and as a part of
> it, it traverses all the pixels and computes some value basing on
> the adjacent pixels. So, the hot part looks like this:
> 
> for(y = 0..height) {
> for (x = 0..width) {
> val = 0
> for (j = 0..5) {
> for (i = 0..5) {
> val += img[x+i,y+j] * weight[i,j]
> }
> }
> }
> }
> 
> And ‘weight' is just a constant matrix with some coefficients.
> 
> If we unroll the two internal loops (with tripcount 5), then we can
> replace weight[i,j] with concrete constant values. In this
> particular case, many of the coefficients are actually 0 or 1, which
> enables huge code simplifications later on. But currently we unroll
> only the innermost one, because unrolling both of them will exceed
> the threshold.
> 
> When deciding whether to unroll or not, we currently look only at the
> instruction count of the loop. My proposal is to, on top of that,
> check if we can enable any later optimizations by unrolling - in
> this case by replacing a load with a constant. Similar to what we do
> in inlining heuristics, we can estimate how many instructions would
> be potentially eliminated after unrolling and adjust our threshold
> with this value.
> 
> I can imagine that it might be also useful for computations,
> involving sparse constant matrixes (like identity matrix).
> 
> The attached patch implements this feature, and with it we handle the
> original testcase well.
> 
> 
> 
> 
> 
> Does it look good? Of course, any ideas, suggestions and other
> feedback are welcome!
> 
> 
> Thanks,
> Michael
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory