[llvm-dev] Jacobi 5 Point Stencil Code not Vectorizing

Sun Jul 2 22:19:20 PDT 2017

Hello,

This is not equivalent rewrite: your original code definitely shouldn't vectorize because it has backward cross-iteration (loop-carried) dependency: your value on iteration j+1 depend on value from iteration j you've just written. In case of vectorization you need to do load-operation-store on multiple consecutive values at once and it is impossible in this case.

In your new code all values of a[] in right-hand side of the main loop are from previous k iteration (because on current k iteration you're writing to 'b'). So there is no way to vectorize loop in its original form, but new form is definitely vectorizable. 

I am second to recommend you filing a bug over 'restrict' behavior. And you may in fact save some memory by making 'b' 1D array (this is not equivalent rewrite once again though)

// This function computes 2D-5 point Jacobi stencil
void stencil(int a[][N], int b[N])
{
  int i, j, k;
    for (k = 0; k < N; k++) {
        for (i = 1; i <= N-2; i++) {
            for (j = 1; j <= N-2; j++)
                 b[j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] + a[i][j-1] + a[i][j+1]);
            for (j = 1; j <= N-2; j++)
                 a[i][j] = b[j];
        }
    }
}

There is a way to vectorize your code even more efficient. This will look like more low-level, but probably closest to what you would expect from vectorization. 

void stencil(int a[][N])
{
    int i, j, k;
    for (k = 0; k < 100; k++) {  
         for (i = 1; i <= N-2; i++) {
            for (j = 1; j <= N-2; j+=16) {
               int b[16];     // This should become a register on KNL
               for (v = 0; v < 16; v++) {
                   b[v] = 0.25 * (a[i][j+v] + a[i-1][j+v] + a[i+1][j+v] + a[i][j-1+v] + a[i][j+1+v]);
               }
               for (v = 0; v < 16; v++) {  // This will be a single store operation
                   a[i][j+v] = b[v]; 
               }
           }
           // You should explicitly take care about the tail of j-loop
#if !MASKED_SHORT_LOOP_VECTORIZATION_SUPPORTED    // This is not an actual name, just a designator
           for (;j <= N-2; j++) {  
               a[i][j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] + a[i][j-1] + a[i][j+1]);
           }
#else
          for (v = 0; v < 16-(N-2-j); v++) {  // This would become masked non-loop
              b[v] = 0.25 * (a[i][j+v] + a[i-1][j+v] + a[i+1][j+v] + a[i][j-1+v] + a[i][j+1+v]);
          }
          for (v = 0; v <  16-(N-2-j); v++) {  // This will be a single masked store operation
              a[i][j+v] = b[v]; 
          }
#endif
    }
}

Unfortunalely compiler cannot do this for you: this is not equivalent transformation of original code. I am also not aware of any way to express this desired behavior less explicitly (e.g. OpenMP SIMD pragma won't work in this case).

Minor note: You're using 'int' for data, than multiply by 0.25 (divide by 4) and than write it back to 'int'. This will cost you 2 conversion to/from double while you may just place (...) / 4  which should be optimized to simple sequecnce with shifts (not to single shift due to signedness, but still better than conversions with changes of element size 4->8->4 and data size INT->FP->INT). 

And by the way why do you divide by 4, not by 5 as number of points suggest?

Serge Preis

02.07.2017, 05:11, "hameeza ahmed via llvm-dev" <llvm-dev at lists.llvm.org>:
> further i modified the code to the following;
>
> #include <stdio.h>
> #define N 100351
>
> // This function computes 2D-5 point Jacobi stencil
> void stencil(int a[restrict][N], int b[restrict][N])
> {
>    int i, j, k;
>    for (k = 0; k < N; k++) {
>        for (i = 1; i <= N-2; i++)
>            for (j = 1; j <= N-2; j++)
>         b[i][j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] + a[i][j-1] + a[i][j+1]);
>
> for (i = 1; i <= N-2; i++)
> for (j = 1; j <= N-2; j++)
>  a[i][j] = b[i][j];
>
> }
> }
>
>    but still no vectorization in IR. Also, when I set vector width explicitly to 64, it gives the following error:
>
> remark: <unknown>:0:0: loop not vectorized: call instruction cannot be vectorized
> remark: <unknown>:0:0: loop not vectorized: value that could not be identified as reduction is used outside the loop
>
> I need serious help on this. Please guide me.
>
> On Sun, Jul 2, 2017 at 1:54 AM, hameeza ahmed <hahmed2305 at gmail.com> wrote:
>> Does it happen due to loop carried dependence? if yes what is the solution to vectorize such codes?
>>
>> please reply. i m waiting.
>>
>> On Jul 1, 2017 12:30 PM, "hameeza ahmed" <hahmed2305 at gmail.com> wrote:
>>> I even tried polly but still my llvm IR does not contain vector instructions. i used the following command;
>>>
>>> clang  -S -emit-llvm stencil.c -march=knl -O3 -mllvm -polly -mllvm -polly-vectorizer=stripmine -o stencil_poly.ll
>>>
>>> Please specify what is wrong with my code?
>>>
>>> On Sat, Jul 1, 2017 at 4:08 PM, hameeza ahmed <hahmed2305 at gmail.com> wrote:
>>>> Hello,
>>>>
>>>> I am trying to vectorize following stencil code;
>>>>
>>>> #include <stdio.h>
>>>> #define N 100351
>>>>
>>>> // This function computes 2D-5 point Jacobi stencil
>>>> void stencil(int a[restrict][N])
>>>> {
>>>>    int i, j, k;
>>>>    for (k = 0; k < 100; k++)
>>>>      {  for (i = 1; i <= N-2; i++)
>>>>          {  for (j = 1; j <= N-2; j++)
>>>>         { a[i][j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] + a[i][j-1] + a[i][j+1]);
>>>>               }
>>>> }
>>>> }}
>>>>
>>>> I have used the following commands
>>>>
>>>> clang  -S -emit-llvm stencil.c -march=knl -O3 -mllvm -disable-llvm-optzns -o stencil.ll
>>>>
>>>> opt  -S -O3 stencil.ll -o stencil_o3.ll
>>>>
>>>> llc -x86-asm-syntax=intel stencil_o3.ll -o stencil.s
>>>>
>>>> But the code is not vectorized. It still uses the scalar instructions;
>>>>
>>>> Please correct me.
>>>>
>>>> Thank You
> ,
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev