[LLVMdev] [PATCH] Loop Rerolling Pass

Wed Oct 23 07:31:18 PDT 2013

I mistakenly not cc’ed the list.

On Oct 23, 2013, at 8:28 AM, Renato Golin <renato.golin at linaro.org> wrote:

> On 23 October 2013 14:13, Arnold <aschwaighofer at apple.com> wrote:
>> What I am proposing for interleaved data vectorization would not transform the loop before vectorization (because there is not much you can do in the general case). There would not be a need for any retries. 
> 
> I know, I digressed.
> 
> 
>> We want to run the loop vectorizer late so we can't rely on a other passes to undo the canonicalisation we have done.
> 
> I didn't mean to change the code and annotate, but to just annotate the unchanged blocks in case we want to try again, we can avoid re-doing the same analysis.
> 
> I've noticed, on the debug-only, that the loop vectorizer passes twice on the same loops, and we even check for validity of previously auto-vectorized loops! This is already a big waste, and if there was a way to avoid that, we could use the saved time for more complex (maybe iterative, speculative) analysis without increasing the compilation time.
> 

Iteresting. This is a bug. Since we moved the loop vectorizer out of the scc-module pass manager we should only run once on every new loop. The vectorized loop is a new loop so runOnLoop will probably be called again. But the Hint we put in the IR should prevent us from analyzing such loops.

I think what you might be seeing is a bug in LoopVectorizerHints:

 /// Mark the loop L as already vectorized by setting the width to 1.
  void setAlreadyVectorized(Loop *L)

We should set width *and* unroll to one. Because this is how we use it:

    LoopVectorizeHints Hints(L, DisableUnrolling);

    if (Hints.Width == 1 && Hints.Unroll == 1) {
      DEBUG(dbgs() << "LV: Not vectorizing.\n");
      return false;
    }

After doing this no vectorized loop should be analyzed again. If not I very much like to know.

http://llvm.org/bugs/show_bug.cgi?id=17662

Thanks,
Arnold

> This is a completely different subject to what we were talking about, that was a bit of a handbrake turn, sorry about that. ;)
> 
> cheers,
> --renato

On Oct 23, 2013, at 8:13 AM, Arnold <aschwaighofer at apple.com> wrote:

> Hi Renato,
> 
> 
> On Oct 23, 2013, at 5:40 AM, Renato Golin <renato.golin at linaro.org> wrote:
> 
>> On 22 October 2013 17:31, Arnold Schwaighofer <aschwaighofer at apple.com>wrote:
>> If you wanted to tackle the general problem of vectorization of interleaved data I am not sure that much of Hal’s code can be immediately reused for the problem.
>> 
>> Hi Arnold,
>> 
>> I agree with you, but some of the cases where a stride would work, Hal's patch would re-roll, and if I base my analysis on what was there before, I won't detect it.
>> 
> You would detect it doing interleaved data in the vectorizer and generate exactly the code that GCC does. 
> 
> Which granted will be less efficient than if we reroll and vectorize in which case we should see a speed up of 8 (8 x i8).
> 
> I have not looked at whether Hal's code could be improved to handle your uncanonical example.
> 
>> Since Hal hinted on leaving that as an API for the vectorizer, I thought that maybe it'd be worth running that first, and delegating the vectorization to other, simpler, parts of the vectorizer.
> Yes we could run his patch before the vectorizers because otherwise in the current pipeline the code would not survive the slp vectorizer.
> This is a phase ordering issue.
>> 
>> As an example, I created two loops:
>> 
>> One, unrolled, with stride access:
>>     for (i=0; i<MAX; i++) {
>>       a[i] = OFFSET - b[i] - DELTA;
>>       a[i+1] = OFFSET - b[i+1] - DELTA;
>>       a[i+2] = OFFSET - b[i+2] - DELTA;
>>     }
>> 
>> One re-rolled, as it would, with Hal's patch:
>>     for (i=0; i<MAX*3; i++) {
>>       a[i] = OFFSET - b[i] - DELTA;
>>     }
>> 
>> The first loop doesn't get vectorized, because there are too many PHIs, but the second does.
>> 
>> So, Hal's patch will deal with the most basic strided access, so that we can focus on the more complex patterns.
>> 
>> There is one slight issue with all this (APIs, and re-tries), and it's that there are already too much analysis going on in the vectorizer, and you don't want to run the canVectorize() too many times, so calling re-roll on failure and trying again, then calling stride-transform and trying again on loops that would never benefit from those treatments might not be a good idea.
> 
> What I am proposing for interleaved data vectorization would not transform the loop before vectorization (because there is not much you can do in the general case). There would not be a need for any retries. 
> 
> You can do all the analysis using SCEVS and the transforms in the vectorizer by treating strided accesses specially.
> 
> The example of transforming the loop to a simpler form I gave in our discussion was as an illustration. I did say "virtually" we make the loop as if it look like the simpler /canonical form. We already to this with unit stride pointer loops.
>> 
>> To deal with this, I thought we could have a table with a list of problems and how to solve them, from cheaper to more expensive analysis. So, say I found a loop with strided access, and the general error from a more basic analysis is to say that we found an "unidentified PHI". We then mark the loop with some metadata to that effect, and the next iteration, it'll try to transform the loop before vectorizing in a way that will increase the chances of it getting vectorized.
>> 
>> Moreover, such a table (and associated metadata), could also be an easier way to annotate vectorization failures, and even used by tools to help developers fix their code.
>> 
>> 
> 
> 
> I think I would not want to modify IR speculatively for analysis to get a canonical form that we can handle.
> And it is also not necessary for the problem at hand.
> SCEV already provides us with a canonical view of the loop.
> 
> 
> I agree that we have to do something about the complexity we have in the loop vectorizer before we add any more significant body of code to it.
> 
> I am not convinced that iteratively speculatively cannonicalizing the loop is the right answer.
> 
> We want to run the loop vectorizer late so we can't rely on a other passes to undo the canonicalisation we have done.
> 
> Thanks,
> Arnold
> 
>>  
>> You want to split the set of load and store instructions with non-unit stride accesses into groups where each group contains useful locality (the members of a group are adjacent).
>> 
>> Thanks again for the how-to, they're helping me get up to speed with the vectorizer again.
>> 
>> cheers,
>> --renato