[RFC] Heuristic for complete loop unrolling

Sun Jan 25 06:06:39 PST 2015

----- Original Message -----
> From: "Michael Zolotukhin" <mzolotukhin at apple.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Arnold Schwaighofer" <aschwaighofer at apple.com>, "Commit Messages and Patches for LLVM"
> <llvm-commits at cs.uiuc.edu>
> Sent: Sunday, January 25, 2015 1:20:18 AM
> Subject: Re: [RFC] Heuristic for complete loop unrolling
> 
> 
> On Jan 24, 2015, at 12:38 PM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> ----- Original Message -----
> 
> 
> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
> To: "Hal Finkel" < hfinkel at anl.gov >
> Cc: "Arnold Schwaighofer" < aschwaighofer at apple.com >, "Commit
> Messages and Patches for LLVM"
> < llvm-commits at cs.uiuc.edu >
> Sent: Saturday, January 24, 2015 2:26:03 PM
> Subject: Re: [RFC] Heuristic for complete loop unrolling
> 
> 
> Hi Hal,
> 
> 
> Thanks for the review! Please see my comments inline.
> 
> 
> 
> 
> On Jan 24, 2015, at 6:44 AM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> [moving patch review to llvm-commits]
> 
> +static bool CanEliminateLoadFrom(Value *V) {
> 
> + if (Constant *C = dyn_cast<Constant>(V)) {
> 
> + if (GlobalVariable *GV = dyn_cast<GlobalVariable>(C))
> 
> + if (GV->isConstant() && GV->hasDefinitiveInitializer())
> 
> 
> Why are you casting to a Constant, and then to a GlobalVariable, and
> then checking GV->isConstant()? There seems to be unnecessary
> redundancy here ;)
> Indeed:)
> 
> 
> 
> 
> + return GV->getInitializer();
> 
> + }
> 
> + return false;
> 
> +}
> 
> +static unsigned ApproximateNumberOfEliminatedInstruction(const Loop
> *L,
> 
> + ScalarEvolution &SE) {
> 
> This function seems mis-named. For one thing, it is not counting the
> number of instructions potentially eliminated, but the number of
> loads. Eliminating the loads might have lead to further constant
> propagation, and really calculating the number of eliminated
> instructions would require estimating that effect, right?
> That’s right. But I’d rather change the function name, than add such
> calculations, since it’ll look very-very narrow targeted, and I
> worry that we might start to lose some cases as well. How about
> ‘NumberOfConstantFoldedLoads’?
> 
> 
> Sounds good to me.
> 
> 
> 
> 
> 
> 
> 
> if (TripCount && Count == TripCount) {
> 
> - if (Threshold != NoThreshold && UnrolledSize > Threshold) {
> 
> + if (Threshold != NoThreshold && UnrolledSize > Threshold + 20 *
> ElimInsns) {
> 
> 
> 20, huh? Is this a heuristic for constant propagation? It feels like
> we should be able to do better than this.
> Yep, that’s a parameter of the heuristic. Counting each ‘constant'
> load as 1 is too conservative and doesn’t give much here,
> 
> Understood.
> 
> 
> 
> but since
> we don’t actually count number of eliminated instructions we need
> some estimate for it. This estimate is really rough, since in some
> cases we can eliminate the entire loop body, while in the others we
> can’t eliminate anything.
> 
> I think that we might as well estimate it. Once we know the loads
> that unrolling will constant fold, put them into a SmallPtrSet S.
> Then, use a worklist-based iteration of the uses of instructions in
> S. For each use that has operands that are all either constants, or
> in S, queue them and keep going. Make sure you walk breadth first
> (not depth first) so that you capture things will multiple feeder
> loads properly. As you do this, count the number of instructions
> that will be constant folded (keeping a Visited set in the usual way
> so you don't over-count). This will be an estimate, but should be a
> very accurate one, and can be done without any heuristic parameters
> (and you still only visit a linear number of instructions, so should
> not be too expensive).
> The biggest gain comes not from expressions buf[0]*buf[3], which can
> be constant folded when buf[0] and buf[3] are substituted with
> constants, but from x*buf[0], when buf[0] turns out to be 0, or 1.
> I.e. we don’t constant-fold it, but we simplify the expression based
> on the particular value. I.e. the original convolution test is in
> some sense equivalent to the following code:
> const int a[] = [0, 1, 5, 1, 0];
> int *b;
> for(i = 0; i < 100; i ++) {
> for(j = 0; j < 5; j++)
> b[i] += b[i+j]*a[j];
> }
> 
> 
> Complete unrolling will give us:
> 
> for(i = 0; i < 100; i ++) {
> b[i] += b[i]*a[0];
> 
> b[i] += b[i+1]*a[1];
> 
> b[i] += b[i+2]*a[2];
> 
> b[i] += b[i+3]*a[3];
> 
> b[i] += b[i+4]*a[4];
> }
> 
> 
> After const-prop we’ll get:
> 
> for(i = 0; i < 100; i ++) {
> b[i] += b[i]*0;
> 
> b[i] += b[i+1]*1;
> 
> b[i] += b[i+2]*5;
> 
> b[i] += b[i+3]*1;
> 
> b[i] += b[i+4]*0;
> }
> And after simplification:
> 
> for(i = 0; i < 100; i ++) {
> b[i] += b[i+1];
> 
> 
> b[i] += b[i+2]*5;
> 
> b[i] += b[i+3];
> 
> }
> 
> 
> As you can see, we actually folded nothing here, but rather we
> simplified more than a half of instructions. But it’s hard to
> predict, which optimizations will be enabled by exact values
> replaced loads 

Agreed.

- i.e. of course we can check it, but it would be too
> optimization-specific in my opinion. 

I understand your point, but I think using one magic number is swinging the pendulum too far in the other direction.

> Thus, I decided to use some
> number (20 in the current patch) which represents some average
> profitability of replacing a load with constant. I think I should
> get rid of this magic number though, and replace it with some
> target-specific parameter (that’ll help to address Owen’s
> suggestions).

Using a target-specific cost (or costs) is a good idea, but having target-specific magic numbers is going to be a mess. Different loops will unroll, or not, for fairly arbitrary reasons, on different targets. This is a heuristic, that I understand, and you won't be able to exactly predict all possible later simplifications enabled by the unrolling. However, the fact that you currently need a large per-instruction boost factor, like 20, I think, means that the model is too coarse.

Maybe it would not be unreasonable to start from the other side: If we first located loads that could be constant folded, and determined the values those loads would actually take, and then simplified the loop instructions based on those values, we'd get a pretty good estimate. This is essentially what the code in lib/Analysis/IPA/InlineCost.cpp does when computing the inlining cost savings, and I think we could use the same technique here. Admittedly, the inline cost analysis is relatively expensive, but I'd think we could impose a reasonable size cutoff to limit the overall expense for the case of full unrolling -- for this, maybe the boost factor of 20 is appropriate -- and would also address Owen's point, as far as I can tell, because like the inline cost analysis, we can use TTI to compute the target-specific cost of GEPs, etc. Ideally, we could refactor the ICA a little bit, and re-use the code in there directly for this purpose. This way we can limit the number of places in which we compute similar kinds of heuristic simplification costs.

 -Hal

> 
> 
> Thanks,
> Michael
> 
> 
> 
> 
> 
> -Hal
> 
> 
> 
> 
> 
> I’d really appreciate a feedback on how to model this in a better
> way.
> 
> 
> Thanks,
> Michael
> 
> 
> 
> 
> Thanks again,
> Hal
> 
> ----- Original Message -----
> 
> 
> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
> To: "LLVM Developers Mailing List ( llvmdev at cs.uiuc.edu )" <
> llvmdev at cs.uiuc.edu >
> Cc: "Hal J. Finkel" < hfinkel at anl.gov >, "Arnold Schwaighofer" <
> aschwaighofer at apple.com >
> Sent: Friday, January 23, 2015 2:05:11 PM
> Subject: [RFC] Heuristic for complete loop unrolling
> 
> 
> 
> Hi devs,
> 
> Recently I came across an interesting testcase that LLVM failed to
> optimize well. The test does some image processing, and as a part of
> it, it traverses all the pixels and computes some value basing on
> the adjacent pixels. So, the hot part looks like this:
> 
> for(y = 0..height) {
> for (x = 0..width) {
> val = 0
> for (j = 0..5) {
> for (i = 0..5) {
> val += img[x+i,y+j] * weight[i,j]
> }
> }
> }
> }
> 
> And ‘weight' is just a constant matrix with some coefficients.
> 
> If we unroll the two internal loops (with tripcount 5), then we can
> replace weight[i,j] with concrete constant values. In this
> particular case, many of the coefficients are actually 0 or 1, which
> enables huge code simplifications later on. But currently we unroll
> only the innermost one, because unrolling both of them will exceed
> the threshold.
> 
> When deciding whether to unroll or not, we currently look only at the
> instruction count of the loop. My proposal is to, on top of that,
> check if we can enable any later optimizations by unrolling - in
> this case by replacing a load with a constant. Similar to what we do
> in inlining heuristics, we can estimate how many instructions would
> be potentially eliminated after unrolling and adjust our threshold
> with this value.
> 
> I can imagine that it might be also useful for computations,
> involving sparse constant matrixes (like identity matrix).
> 
> The attached patch implements this feature, and with it we handle the
> original testcase well.
> 
> 
> 
> 
> 
> Does it look good? Of course, any ideas, suggestions and other
> feedback are welcome!
> 
> 
> Thanks,
> Michael
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory