[PATCH] Loop Rerolling Pass

Tue Oct 22 09:31:33 PDT 2013

Right your case is simpler than the general problem of vectorizing interleaved data:

You can’t reroll this: 
for(int i = 0; i < len; i++) {
    c[2i] = a[2i]*b[2i] - a[2i+1]*b[2i+1]
    c[2i+1] = a[2i]*b[2i+1] + a[2i+1]*b[2i];
}

But you can reroll your example:

void fn1 (unsigned MAX, char *READ, char *WRITE)
{
    unsigned i;
    char e, f, g;
    char h, j, k;

    for (i=0; i<MAX; i++) {
       e = *READ++;
       f = *READ++;
       g = *READ++;

       h = OFFSET - e - DELTA;
       j = OFFSET - f - DELTA;
       k = OFFSET - g - DELTA;

       *WRITE++ = h;
       *WRITE++ = j;
       *WRITE++ = k;
    }
}

Which really is just:

for (i = 0; i < MAX; i++) {
  WRITE[3*i] = OFFSET-READ[3*i] - DELTA;
  WRITE[3*i+1] = OFFSET-READ[3*i+1] - DELTA;
  WRITE[3*i+2] = OFFSET-READ[3*i+2] - DELTA;
}

But say your example where to be changed to 

for (i = 0; i < MAX; i++) {
  WRITE[3*i] = OFFSET-READ[3*i] - DELTA1;
  WRITE[3*i+1] = OFFSET-READ[3*i+1] - DELTA2;
  WRITE[3*i+2] = OFFSET-READ[3*i+2] - DELTA3;
}

We could not longer reroll it.

If you wanted to tackle the general problem of vectorization of interleaved data I am not sure that much of Hal’s code can be immediately reused for the problem. You want to split the set of load and store instructions with non-unit stride accesses into groups where each group contains useful locality (the members of a group are adjacent). 

for(int i = 0; i < len; i++) {
    c[2i] = a[2i]*b[2i] - a[2i+1]*b[2i+1]
    c[2i+1] = a[2i]*b[2i+1] + a[2i+1]*b[2i];
}

In this loop you want to create groups:

{c[2i],c[2i+1]}
{a[2i],a[2i+1]}
[b[2i],b[2i+1]}

For this you want to analyze the  the memory access’ pointer SCEV.

Say you had:

int a[]; a[2*i] …;  a[2*1+1} ;

The add recurrence will look something like:

{%a, +, 8}_loop
{(%a + 4), +, 8}_loop

They share a common underlying pointer and the step is bigger than the type size (non-unit stride) so you can put them in a group. They are adjacent because their start value is offset by the type size. ….

Once you have those groups, you can treat those accesses specially in the loop vectorizer during vectorization (make sure that vectorizing them is legal, give them a cheaper cost during cost estimation, emit special code during vectorization). We can emit vector loads/stores for them with permutation operations for their users/inputs.

Best,
Arnold

On Oct 22, 2013, at 10:12 AM, Renato Golin <renato.golin at linaro.org> wrote:

> Hi Andrew, Hal, Nick,
> 
> How's this coming along?
> 
> Talking to Arnold, I think this pass could help the stride vectorization, or at least, the result of this discussion can help me identify (if the API is clear) the stride case, and transform the loop accordingly, for vectorization.
> 
> cheers,
> --renato