[PATCH] Reimplement heuristic for estimating complete-unroll optimization effects.

Thu Apr 9 18:15:17 PDT 2015

================
Comment at: lib/Transforms/Scalar/LoopUnrollPass.cpp:690-703
@@ +689,16 @@
+
+  unsigned PercentOfOptimizedInstructions =
+      NumberOfOptimizedInstructions * 100 /
+      UnrolledSize; // The previous check guards us from div by 0
+  if (UnrolledSize <= AbsoluteThreshold &&
+      PercentOfOptimizedInstructions >= PercentOfOptimizedForCompleteUnroll) {
+    DEBUG(dbgs() << "  Can fully unroll, because unrolling will help removing "
+                 << PercentOfOptimizedInstructions
+                 << "% instructions (threshold: "
+                 << PercentOfOptimizedForCompleteUnroll << "%)\n");
+    DEBUG(dbgs() << "  Unrolled size (" << UnrolledSize
+                 << ") is less than the threshold (" << AbsoluteThreshold
+                 << ").\n");
+    return true;
+  }
+
----------------
chandlerc wrote:
> Can you explain some of your motivations for having the double threshold and percentage query? It seems really awkward to implement, and so I'm curious what the need is here. If we could get around with just having a flat threshold, it'd make me happy. =]
The idea is the following: currently we have a threshold for unrolling small loops (~200 instructions). What I want to add is a possiblity to go beyond this threshold, but only if that gives a performance benefit. E.g. if unrolled loop would be 500 instructions, but it would be 30% faster than the original loop, then we want to unroll it. But we do not want to unroll this loop if it would become only 5% faster (in terms of cost of executed instructions). On the other hand, we don't want to unroll loops with huge trip counts, even if the resultant code seems to be faster. I.e. if unrolling would help to eliminate 50% of instructions, but the trip count is 10^9, we definitely don't want to unroll it.

And several examples to illustrate the idea:
a)
```
int b[] = {0,0,0...0,1}; // most of the values are 0
for (i = 0; i < 500; i++) {
  t = b[i] * c[i];
  a[i] = t * d[i];
}
```
If we completely unroll the loop, we'll get something like:
```
t = b[0]*c[0];
a[0] = t * d[0];
t = b[1]*c[1];
a[1] = t * d[1];
...
t = b[499]*c[499];
a[499] = t * d[499];
```
which would be simplified to:
```
a[0] = 0; // b[0] == 0
a[1] = 0; // b[1] == 0
...
a[498] = 0;  // b[498] == 0
a[499] = c[499]*d[499]; //b[499] == 1
```
That is, unrolling helps to remove ~50% instructions in this case - and that's not about code size, it's about execution time, because in the original loop we have to execute every MUL instruction, since we don't know exact value of b[i].

b)
```
/* The same example as before, but with a huge trip count. */
int b[] = {0,0,0...0,1}; // most of the values are 0
for (i = 0; i < 500000; i++) {
  t = b[i] * c[i];
  a[i] = t * d[i];
}
```
We want to give up on this loop, because unrolled version would be way too big. We might have some problems compiling it, and even if we compile it succesfully, we might be hit hard by cache/memory effects.

c)
```
/* The same example as (a), but unrolling doesn't help to simplify anything. */
int b[] = {6,2,3...4,7}; // no 0 or 1 values
for (i = 0; i < 500; i++) {
  t = b[i] * c[i];
  a[i] = t * d[i];
}
```
We don't want to just start unrolling any loop with higher trip count than we unrolled before, if that doesn't promise any performance benefits.

So, to distinguish (a) and (b), we use 'AbsoluteThreshold'. To distinguish (a) and (c) we use percentage.

http://reviews.llvm.org/D8816

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/