[RFC] Heuristic for complete loop unrolling

Mon Feb 2 21:38:37 PST 2015

----- Original Message -----
> From: "Michael Zolotukhin" <mzolotukhin at apple.com>
> To: "Hal J. Finkel" <hfinkel at anl.gov>
> Cc: "Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>
> Sent: Monday, February 2, 2015 7:25:30 PM
> Subject: Re: [RFC] Heuristic for complete loop unrolling
> 
> 
> Hi Hal,
> 
> Please find a new version attached. I broke it into two parts: the
> first one is the implementation of new heuristic, and the second is
> for adjusting cost model.
> 
> 1. 0001-Implement-new-heuristic-for-complete-loop-unrolling.patch
> 
> I added a new class UnrollAnalyzer (like InlineAnalyzer), that
> estimates possible optimization effects of complete-unrolling. Now
> we simulate inst-simplify by visiting all users of loads that might
> become constant, and then we simulate DCE, which also might perform
> significant clean-ups here. The counted number of optimized
> instruction then returned for further consideration (in this patch
> we just add it to threshold - that’s a conservative way of treating
> it, since after the mentioned optimizations we still should be under
> the threshold).
> 
> 2. 0002-Use-estimated-number-of-optimized-insns-in-unroll-th.patch
> 
> In this patch we use more aggressive strategy in choosing threshold.
> My idea here is that we can go beyond the default threshold if we
> can significantly optimize the unrolled body, but we still should be
> in reasonable limits. Let’s look at examples to illustrate it
> better:
> a) DefaultThreshold=150, LoopSize=50, IterationsNumber=5,
> UnrolledLoopSize=50*5=250,
> NumberOfPotentiallyOptimizedInstructions=90
> In this case after unroll we would get 250 instructions, and after
> inst-simplify+DCE, we’ll go down to 160 instructions. Though it’s
> still bigger than DefaultThreshold, I think we do want to unroll
> here, since it would speed up this part by ~90/250=36%.
> 
> 
> b) DefaultThreshold=150, LoopSize=1000, IterationsNumber=1000,
> UnrolledLoopSize=1000*1000,
> NumberOfPotentiallyOptimizedInstructions=500. The absolute number of
> optimized instructions in this example is bigger than in the
> previous one, but we don’t want to unroll here, because the
> resultant code would be huge, and we’d only save
> 500/(1000*1000)=0.05% instructions.
> 
> 
> To handle both situations, I suggest only unroll if we can optimize
> significant portion of the loop body, e.g. if after unrolling we can
> remove N% of code. We might want to have an absolute upper limit for
> final code size too (i.e. even if we optimize 50% of instructions,
> don’t unroll loop with 10^8 iterations), but for now I decided to
> add only one new parameter to avoid over-complicating things.
> 
> 
> In the patch I use weight for optimized instruction (yep, like in the
> previous patch, but here it’s more meaningful), which is essentially
> equivalent to a ratio between original and removed code size.

I'm not particularly keen on this bonus parameter, but do like the percentage reduction test you've outlined. Can you please implement that instead?

> 
> 
> My testing is still in progress and probably I’ll do some tuning
> later, the current default value (10) was taken almost at random.
> 
> Do these patches look better?
> 

Yes, much better. A few comments on the first:

+      // variable. Now it's time if it corresponds to a global constant global
+      // (in which case we can eliminate the load), or not.

... time to see if ...

+// This class is used to get an estimate of optimization effect that we could

optimization effect -> the optimization effects

+// get from complete loop unrolling. It comes from the fact that some loads
+// might be replaced with a concrete constant values and that could trigger a
+// chain of instruction simplifications.

a concrete -> concrete [remove 'a']

+  bool visitInstruction(Instruction &I) { return false; };
+  bool visitBinaryOperator(BinaryOperator &I) {

Visiting binary operators is good, but we should also visit ICmp, FCmp, GetElementPtr, Trunc, ZExt, SExt, FPTrunc, FPExt, FPToUI, FPToSI, UIToFP, SIToFP, BitCast, Select, ExtractElement, InsertElement, ShuffleVector, ExtractValue, InsertValue. It might be easier to just do this from the visitInstruction callback. 

+    unsigned ElemSize = CDS->getElementType()->getPrimitiveSizeInBits() / 8U;
+    unsigned Start = StartC.getLimitedValue();
+    unsigned Step = StepC.getLimitedValue();
+
+    unsigned Index = (Start + Step * Iteration) / ElemSize;
+
+    Constant *CV = CDS->getElementAsConstant(Index);

I think you need to guard here against out-of-bounds accesses (we should simply return nullptr and not assert or segfault from the OOB access to the CDS.

+  // Visit all loads the loop L, and for those that after complete loop
+  // unrolling would have a constant address and it will point to from a known

... that, after complete loop unrolling, would ... [add commas]

also

it will point to from a known -> it will point to a known [remove 'from']

+// This routine estimates this optimization effect and return number of
+// instructions, that potentially might be optimized away.

return number -> returns the number

+  // iterations here. To limit ourselves here, we check only first 1000
+  // iterations, and then scale the found number, if necessary.
+  unsigned IterationsNumberForEstimate = std::min(1000u, TripCount);

Don't embed 1000 here; make this a cl::opt.

With these changes, the first patch LGTM.

Thanks again,
Hal

> 
> Thanks,
> Michael
> 
> 
> 
> 
> 
> 
> 
> On Jan 26, 2015, at 11:50 AM, Michael Zolotukhin <
> mzolotukhin at apple.com > wrote:
> 
> 
> 
> 
> 
> 
> On Jan 25, 2015, at 6:06 AM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> ----- Original Message -----
> 
> 
> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
> To: "Hal Finkel" < hfinkel at anl.gov >
> Cc: "Arnold Schwaighofer" < aschwaighofer at apple.com >, "Commit
> Messages and Patches for LLVM"
> < llvm-commits at cs.uiuc.edu >
> Sent: Sunday, January 25, 2015 1:20:18 AM
> Subject: Re: [RFC] Heuristic for complete loop unrolling
> 
> 
> On Jan 24, 2015, at 12:38 PM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> ----- Original Message -----
> 
> 
> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
> To: "Hal Finkel" < hfinkel at anl.gov >
> Cc: "Arnold Schwaighofer" < aschwaighofer at apple.com >, "Commit
> Messages and Patches for LLVM"
> < llvm-commits at cs.uiuc.edu >
> Sent: Saturday, January 24, 2015 2:26:03 PM
> Subject: Re: [RFC] Heuristic for complete loop unrolling
> 
> 
> Hi Hal,
> 
> 
> Thanks for the review! Please see my comments inline.
> 
> 
> 
> 
> On Jan 24, 2015, at 6:44 AM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> [moving patch review to llvm-commits]
> 
> +static bool CanEliminateLoadFrom(Value *V) {
> 
> + if (Constant *C = dyn_cast<Constant>(V)) {
> 
> + if (GlobalVariable *GV = dyn_cast<GlobalVariable>(C))
> 
> + if (GV->isConstant() && GV->hasDefinitiveInitializer())
> 
> 
> Why are you casting to a Constant, and then to a GlobalVariable, and
> then checking GV->isConstant()? There seems to be unnecessary
> redundancy here ;)
> Indeed:)
> 
> 
> 
> 
> + return GV->getInitializer();
> 
> + }
> 
> + return false;
> 
> +}
> 
> +static unsigned ApproximateNumberOfEliminatedInstruction(const Loop
> *L,
> 
> + ScalarEvolution &SE) {
> 
> This function seems mis-named. For one thing, it is not counting the
> number of instructions potentially eliminated, but the number of
> loads. Eliminating the loads might have lead to further constant
> propagation, and really calculating the number of eliminated
> instructions would require estimating that effect, right?
> That’s right. But I’d rather change the function name, than add such
> calculations, since it’ll look very-very narrow targeted, and I
> worry that we might start to lose some cases as well. How about
> ‘NumberOfConstantFoldedLoads’?
> 
> 
> Sounds good to me.
> 
> 
> 
> 
> 
> 
> 
> if (TripCount && Count == TripCount) {
> 
> - if (Threshold != NoThreshold && UnrolledSize > Threshold) {
> 
> + if (Threshold != NoThreshold && UnrolledSize > Threshold + 20 *
> ElimInsns) {
> 
> 
> 20, huh? Is this a heuristic for constant propagation? It feels like
> we should be able to do better than this.
> Yep, that’s a parameter of the heuristic. Counting each ‘constant'
> load as 1 is too conservative and doesn’t give much here,
> 
> Understood.
> 
> 
> 
> but since
> we don’t actually count number of eliminated instructions we need
> some estimate for it. This estimate is really rough, since in some
> cases we can eliminate the entire loop body, while in the others we
> can’t eliminate anything.
> 
> I think that we might as well estimate it. Once we know the loads
> that unrolling will constant fold, put them into a SmallPtrSet S.
> Then, use a worklist-based iteration of the uses of instructions in
> S. For each use that has operands that are all either constants, or
> in S, queue them and keep going. Make sure you walk breadth first
> (not depth first) so that you capture things will multiple feeder
> loads properly. As you do this, count the number of instructions
> that will be constant folded (keeping a Visited set in the usual way
> so you don't over-count). This will be an estimate, but should be a
> very accurate one, and can be done without any heuristic parameters
> (and you still only visit a linear number of instructions, so should
> not be too expensive).
> The biggest gain comes not from expressions buf[0]*buf[3], which can
> be constant folded when buf[0] and buf[3] are substituted with
> constants, but from x*buf[0], when buf[0] turns out to be 0, or 1.
> I.e. we don’t constant-fold it, but we simplify the expression based
> on the particular value. I.e. the original convolution test is in
> some sense equivalent to the following code:
> const int a[] = [0, 1, 5, 1, 0];
> int *b;
> for(i = 0; i < 100; i ++) {
> for(j = 0; j < 5; j++)
> b[i] += b[i+j]*a[j];
> }
> 
> 
> Complete unrolling will give us:
> 
> for(i = 0; i < 100; i ++) {
> b[i] += b[i]*a[0];
> 
> b[i] += b[i+1]*a[1];
> 
> b[i] += b[i+2]*a[2];
> 
> b[i] += b[i+3]*a[3];
> 
> b[i] += b[i+4]*a[4];
> }
> 
> 
> After const-prop we’ll get:
> 
> for(i = 0; i < 100; i ++) {
> b[i] += b[i]*0;
> 
> b[i] += b[i+1]*1;
> 
> b[i] += b[i+2]*5;
> 
> b[i] += b[i+3]*1;
> 
> b[i] += b[i+4]*0;
> }
> And after simplification:
> 
> for(i = 0; i < 100; i ++) {
> b[i] += b[i+1];
> 
> 
> b[i] += b[i+2]*5;
> 
> b[i] += b[i+3];
> 
> }
> 
> 
> As you can see, we actually folded nothing here, but rather we
> simplified more than a half of instructions. But it’s hard to
> predict, which optimizations will be enabled by exact values
> replaced loads
> 
> Agreed.
> 
> - i.e. of course we can check it, but it would be too
> 
> 
> optimization-specific in my opinion.
> 
> I understand your point, but I think using one magic number is
> swinging the pendulum too far in the other direction.
> 
> 
> 
> Thus, I decided to use some
> number (20 in the current patch) which represents some average
> profitability of replacing a load with constant. I think I should
> get rid of this magic number though, and replace it with some
> target-specific parameter (that’ll help to address Owen’s
> suggestions).
> 
> Using a target-specific cost (or costs) is a good idea, but having
> target-specific magic numbers is going to be a mess. Different loops
> will unroll, or not, for fairly arbitrary reasons, on different
> targets. This is a heuristic, that I understand, and you won't be
> able to exactly predict all possible later simplifications enabled
> by the unrolling. However, the fact that you currently need a large
> per-instruction boost factor, like 20, I think, means that the model
> is too coarse.
> 
> Maybe it would not be unreasonable to start from the other side: If
> we first located loads that could be constant folded, and determined
> the values those loads would actually take, and then simplified the
> loop instructions based on those values, we'd get a pretty good
> estimate. This is essentially what the code in
> lib/Analysis/IPA/InlineCost.cpp does when computing the inlining
> cost savings, and I think we could use the same technique here.
> Admittedly, the inline cost analysis is relatively expensive, but
> I'd think we could impose a reasonable size cutoff to limit the
> overall expense for the case of full unrolling -- for this, maybe
> the boost factor of 20 is appropriate -- and would also address
> Owen's point, as far as I can tell, because like the inline cost
> analysis, we can use TTI to compute the target-specific cost of
> GEPs, etc. Ideally, we could refactor the ICA a little bit, and
> re-use the code in there directly for this purpose. This way we can
> limit the number of places in which we compute similar kinds of
> heuristic simplification costs.
> Hi Hal,
> 
> 
> Thanks for the comments. I think that’s a good idea, and I’ll try
> that out. If that doesn’t lead to over-complicated implementation,
> I’d like it much better than having a magic number. I’ll prepare a
> new patch soon.
> 
> 
> Thanks,
> Michael
> 
> 
> 
> 
> 
> -Hal
> 
> 
> 
> 
> 
> Thanks,
> Michael
> 
> 
> 
> 
> 
> -Hal
> 
> 
> 
> 
> 
> I’d really appreciate a feedback on how to model this in a better
> way.
> 
> 
> Thanks,
> Michael
> 
> 
> 
> 
> Thanks again,
> Hal
> 
> ----- Original Message -----
> 
> 
> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
> To: "LLVM Developers Mailing List ( llvmdev at cs.uiuc.edu )" <
> llvmdev at cs.uiuc.edu >
> Cc: "Hal J. Finkel" < hfinkel at anl.gov >, "Arnold Schwaighofer" <
> aschwaighofer at apple.com >
> Sent: Friday, January 23, 2015 2:05:11 PM
> Subject: [RFC] Heuristic for complete loop unrolling
> 
> 
> 
> Hi devs,
> 
> Recently I came across an interesting testcase that LLVM failed to
> optimize well. The test does some image processing, and as a part of
> it, it traverses all the pixels and computes some value basing on
> the adjacent pixels. So, the hot part looks like this:
> 
> for(y = 0..height) {
> for (x = 0..width) {
> val = 0
> for (j = 0..5) {
> for (i = 0..5) {
> val += img[x+i,y+j] * weight[i,j]
> }
> }
> }
> }
> 
> And ‘weight' is just a constant matrix with some coefficients.
> 
> If we unroll the two internal loops (with tripcount 5), then we can
> replace weight[i,j] with concrete constant values. In this
> particular case, many of the coefficients are actually 0 or 1, which
> enables huge code simplifications later on. But currently we unroll
> only the innermost one, because unrolling both of them will exceed
> the threshold.
> 
> When deciding whether to unroll or not, we currently look only at the
> instruction count of the loop. My proposal is to, on top of that,
> check if we can enable any later optimizations by unrolling - in
> this case by replacing a load with a constant. Similar to what we do
> in inlining heuristics, we can estimate how many instructions would
> be potentially eliminated after unrolling and adjust our threshold
> with this value.
> 
> I can imagine that it might be also useful for computations,
> involving sparse constant matrixes (like identity matrix).
> 
> The attached patch implements this feature, and with it we handle the
> original testcase well.
> 
> 
> 
> 
> 
> Does it look good? Of course, any ideas, suggestions and other
> feedback are welcome!
> 
> 
> Thanks,
> Michael
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory