[LLVMdev] [RFC] Heuristic for complete loop unrolling

Gerolf Hoflehner ghoflehner at apple.com
Mon Jan 26 12:01:15 PST 2015


> On Jan 23, 2015, at 2:20 PM, Michael Zolotukhin <mzolotukhin at apple.com> wrote:
> 
>> 
>> On Jan 23, 2015, at 1:27 PM, James Molloy <james at jamesmolloy.co.uk <mailto:james at jamesmolloy.co.uk>> wrote:
>> 
>> Hi Michael,
> Hi James,
>> 
>> This looks very much like the inner loops from convolve() in Geekbench... ;)
> Yes, the original example was from Geekbench, but I wasn’t sure if I can copy it:)
> 
>> Interestingly if we reroll that loop nest, the number of instructions inside goes down by a factor of 3 so we do decide to inline them all. But that is just luck / a hack. I completely agree that our current unrolling heuristics don't take the right things into account (our inlining heuristics are similar!).
>> 
>> When thinking about this previously, I've been worried that there's a slippery slope where you could end up implementing half of the rest of the compiler's analysis passes in the unroller/inliner to check for opportunities. Do you have any thoughts on that? How do we keep it under control?
> That’s a valid point. Theoretically, I think the best solution here would be to run remaining passes in ‘sand-box’ mode and decide what to do basing on what this run shows. But it’s hard to tell when to stop in these sand-box runs: Should we just run const-prop+InstCombine? Should we try other loop optimizations after that? Should we run all passes including backend (e.g. unrolling will have a huge impact on scheduling, so it could really be interesting)? And in any case, it would be quite compile-time expensive.
> 
> So, right now it seems to me that the best solution is to model future optimizations by (hopefully simple) heuristics. I don’t think that we should worry much that the code reminds const-prop, or another simple pass. However, if we try to reimplement a vectorizer, or, say SROA, inside our heuristic, then probably something is wrong. As I said, we should either model such passes with simple heuristics, or we should think about interfaces in these passes that would allow to execute dry-runs.
> 
> E.g. another possible heuristic for unroller would be the following: if after unrolling all loads/stores are consecutive, then it’s beneficial to unroll, because the code will be probably vectorized by SLP later. That’s a simple heuristic, and in some cases it will give a wrong advice. We can either accept it, or add an interface to SLP so that it'll answer whether it can vectorize an instructions sequence or not. That would be the second way, more accurate but more expensive.
> 
> In some cases the first approach would be acceptable, in others - the second, and I think that should be decided on a case-by-case basis.
> 
> Michael

There is no better short-term solution than experimenting with heuristics. But down the road we could think of orchestrating optimizations so that specific heuristics and their expected outcome  (eg. #loads saved per loop) is recorded and a later phase un-does the optimization (eg. re-rolling the loop or even outlining) when the expected benefits have not been realized (or not been realized significantly enough). This would provide a more global optimization conscious approach that should protect from outlier cases when heuristics goes wrong. 

> 
>> 
>> cc'ing Kevin who has been doing a lot of work recently on the loop unroller.
>> 
>> Cheers,
>> 
>> James
>> 
>> On Fri Jan 23 2015 at 8:52:30 PM Michael Zolotukhin <mzolotukhin at apple.com <mailto:mzolotukhin at apple.com>> wrote:
>> Hi devs,
>> 
>> Recently I came across an interesting testcase that LLVM failed to optimize well. The test does some image processing, and as a part of it, it traverses all the pixels and computes some value basing on the adjacent pixels. So, the hot part looks like this:
>> 
>> for(y = 0..height) {
>>  for (x = 0..width) {
>>    val = 0
>>    for (j = 0..5) {
>>      for (i = 0..5) {
>>        val += img[x+i,y+j] * weight[i,j]
>>      }
>>    }
>>  }
>> }
>> 
>> And ‘weight' is just a constant matrix with some coefficients.
>> 
>> If we unroll the two internal loops (with tripcount 5), then we can replace weight[i,j] with concrete constant values. In this particular case, many of the coefficients are actually 0 or 1, which enables huge code simplifications later on. But currently we unroll only the innermost one, because unrolling both of them will exceed the threshold.
>> 
>> When deciding whether to unroll or not, we currently look only at the instruction count of the loop. My proposal is to, on top of that, check if we can enable any later optimizations by unrolling - in this case by replacing a load with a constant. Similar to what we do in inlining heuristics, we can estimate how many instructions would be potentially eliminated after unrolling and adjust our threshold with this value.
>> 
>> I can imagine that it might be also useful for computations, involving sparse constant matrixes (like identity matrix).
>> 
>> The attached patch implements this feature, and with it we handle the original testcase well.
>> 
>> 
>> 
>> Does it look good? Of course, any ideas, suggestions and other feedback are welcome!
>> 
>> 
>> Thanks,
>> Michael_______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>         http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150126/e161afe4/attachment.html>


More information about the llvm-dev mailing list