[llvm-dev] Jacobi 5 Point Stencil Code not Vectorizing

Tue Oct 24 06:30:23 PDT 2017

Your problem is due to GVN partial reduction elimination (PRE) which
introduces a PHI node the current loop vectorizer cannot handle:

opt -O3 stencil.ll -pass-remarks=loop-vectorize
-pass-remarks-missed=loop-vectorize
-pass-remarks-analysis=loop-vectorize
remark: <unknown>:0:0: loop not vectorized: value that could not be
identified as reduction is used outside the loop
remark: <unknown>:0:0: loop not vectorized

The message is not entirely accurate. The PHI is not used outside of
the loop, but is in the loop header, therefore cannot be if-converted
and is not a reduction. Polly also had difficulties with this
construction as well.

As presented at the dev meeting, VPlan will be able to insert a
shuffle instruction when encountering this situation.

Michael

2017-10-23 8:58 GMT+02:00 hameeza ahmed via llvm-dev <llvm-dev at lists.llvm.org>:
> stencil.ll is attached here.
>
> On Mon, Oct 23, 2017 at 11:37 AM, Serge Preis <spreis at yandex-team.ru> wrote:
>>
>>
>>
>> Hello,
>>
>> To me this is an issue in llvm loop vectorizer (if N is large enough to
>> prevent complete unrolling of j-loop).
>>
>> Woud you mind to share stencil.ll than I would say more definitely what
>> the issue is.
>>
>> Regards,
>> Serge.
>>
>>
>> 22.10.2017, 03:21, "hameeza ahmed" <hahmed2305 at gmail.com>:
>>
>> Hello,
>>
>> The following stencil function that you wrote few months ago (for
>> computing 2D-5 point Jacobi stencil) is vectorizable.
>>
>> void stencil(int a[][N], int b[N])
>> {
>>   int i, j, k;
>>     for (k = 0; k < N; k++) {
>>         for (i = 1; i <= N-2; i++) {
>>             for (j = 1; j <= N-2; j++)
>>                  b[j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] +
>> a[i][j-1] + a[i][j+1]);
>>             for (j = 1; j <= N-2; j++)
>>                  a[i][j] = b[j];
>>         }
>>     }
>> }
>>
>> but when i write the above code in main i am getting error
>>
>> opt  -S -O3  stencil.ll -o stencil_o3.ll
>> remark: <unknown>:0:0: loop not vectorized: value that could not be
>> identified as reduction is used outside the loop
>>
>> Why is that so?
>>
>> my code is follows:
>> int a[N][N],  b[N];
>> int main(void)
>> {
>>   int i, j, k;
>>     for (k = 0; k < N; k++) {
>>         for (i = 1; i <= N-2; i++) {
>>             for (j = 1; j <= N-2; j++)
>>                  b[j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] +
>> a[i][j-1] + a[i][j+1]);
>>             for (j = 1; j <= N-2; j++)
>>                  a[i][j] = b[j];
>>         }
>>     }
>> }
>>
>>
>> Please help.
>>
>> Thank You.
>>
>>
>> On Mon, Jul 3, 2017 at 10:19 AM, Serge Preis <spreis at yandex-team.ru>
>> wrote:
>>
>> Hello,
>>
>> This is not equivalent rewrite: your original code definitely shouldn't
>> vectorize because it has backward cross-iteration (loop-carried) dependency:
>> your value on iteration j+1 depend on value from iteration j you've just
>> written. In case of vectorization you need to do load-operation-store on
>> multiple consecutive values at once and it is impossible in this case.
>>
>> In your new code all values of a[] in right-hand side of the main loop are
>> from previous k iteration (because on current k iteration you're writing to
>> 'b'). So there is no way to vectorize loop in its original form, but new
>> form is definitely vectorizable.
>>
>> I am second to recommend you filing a bug over 'restrict' behavior. And
>> you may in fact save some memory by making 'b' 1D array (this is not
>> equivalent rewrite once again though)
>>
>> // This function computes 2D-5 point Jacobi stencil
>> void stencil(int a[][N], int b[N])
>> {
>>   int i, j, k;
>>     for (k = 0; k < N; k++) {
>>         for (i = 1; i <= N-2; i++) {
>>             for (j = 1; j <= N-2; j++)
>>                  b[j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] +
>> a[i][j-1] + a[i][j+1]);
>>             for (j = 1; j <= N-2; j++)
>>                  a[i][j] = b[j];
>>         }
>>     }
>> }
>>
>> There is a way to vectorize your code even more efficient. This will look
>> like more low-level, but probably closest to what you would expect from
>> vectorization.
>>
>> void stencil(int a[][N])
>> {
>>     int i, j, k;
>>     for (k = 0; k < 100; k++) {
>>          for (i = 1; i <= N-2; i++) {
>>             for (j = 1; j <= N-2; j+=16) {
>>                int b[16];     // This should become a register on KNL
>>                for (v = 0; v < 16; v++) {
>>                    b[v] = 0.25 * (a[i][j+v] + a[i-1][j+v] + a[i+1][j+v] +
>> a[i][j-1+v] + a[i][j+1+v]);
>>                }
>>                for (v = 0; v < 16; v++) {  // This will be a single store
>> operation
>>                    a[i][j+v] = b[v];
>>                }
>>            }
>>            // You should explicitly take care about the tail of j-loop
>> #if !MASKED_SHORT_LOOP_VECTORIZATION_SUPPORTED    // This is not an actual
>> name, just a designator
>>            for (;j <= N-2; j++) {
>>                a[i][j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] +
>> a[i][j-1] + a[i][j+1]);
>>            }
>> #else
>>           for (v = 0; v < 16-(N-2-j); v++) {  // This would become masked
>> non-loop
>>               b[v] = 0.25 * (a[i][j+v] + a[i-1][j+v] + a[i+1][j+v] +
>> a[i][j-1+v] + a[i][j+1+v]);
>>           }
>>           for (v = 0; v <  16-(N-2-j); v++) {  // This will be a single
>> masked store operation
>>               a[i][j+v] = b[v];
>>           }
>> #endif
>>     }
>> }
>>
>> Unfortunalely compiler cannot do this for you: this is not equivalent
>> transformation of original code. I am also not aware of any way to express
>> this desired behavior less explicitly (e.g. OpenMP SIMD pragma won't work in
>> this case).
>>
>> Minor note: You're using 'int' for data, than multiply by 0.25 (divide by
>> 4) and than write it back to 'int'. This will cost you 2 conversion to/from
>> double while you may just place (...) / 4  which should be optimized to
>> simple sequecnce with shifts (not to single shift due to signedness, but
>> still better than conversions with changes of element size 4->8->4 and data
>> size INT->FP->INT).
>>
>> And by the way why do you divide by 4, not by 5 as number of points
>> suggest?
>>
>> Serge Preis
>>
>>
>> 02.07.2017, 05:11, "hameeza ahmed via llvm-dev" <llvm-dev at lists.llvm.org>:
>> > further i modified the code to the following;
>> >
>> > #include <stdio.h>
>> > #define N 100351
>> >
>> > // This function computes 2D-5 point Jacobi stencil
>> > void stencil(int a[restrict][N], int b[restrict][N])
>> > {
>> >    int i, j, k;
>> >    for (k = 0; k < N; k++) {
>> >        for (i = 1; i <= N-2; i++)
>> >            for (j = 1; j <= N-2; j++)
>> >         b[i][j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] + a[i][j-1] +
>> > a[i][j+1]);
>> >
>> > for (i = 1; i <= N-2; i++)
>> > for (j = 1; j <= N-2; j++)
>> >  a[i][j] = b[i][j];
>> >
>> > }
>> > }
>> >
>> >    but still no vectorization in IR. Also, when I set vector width
>> > explicitly to 64, it gives the following error:
>> >
>> > remark: <unknown>:0:0: loop not vectorized: call instruction cannot be
>> > vectorized
>> > remark: <unknown>:0:0: loop not vectorized: value that could not be
>> > identified as reduction is used outside the loop
>> >
>> > I need serious help on this. Please guide me.
>> >
>> > On Sun, Jul 2, 2017 at 1:54 AM, hameeza ahmed <hahmed2305 at gmail.com>
>> > wrote:
>> >> Does it happen due to loop carried dependence? if yes what is the
>> >> solution to vectorize such codes?
>> >>
>> >> please reply. i m waiting.
>> >>
>> >> On Jul 1, 2017 12:30 PM, "hameeza ahmed" <hahmed2305 at gmail.com> wrote:
>> >>> I even tried polly but still my llvm IR does not contain vector
>> >>> instructions. i used the following command;
>> >>>
>> >>> clang  -S -emit-llvm stencil.c -march=knl -O3 -mllvm -polly -mllvm
>> >>> -polly-vectorizer=stripmine -o stencil_poly.ll
>> >>>
>> >>> Please specify what is wrong with my code?
>> >>>
>> >>> On Sat, Jul 1, 2017 at 4:08 PM, hameeza ahmed <hahmed2305 at gmail.com>
>> >>> wrote:
>> >>>> Hello,
>> >>>>
>> >>>> I am trying to vectorize following stencil code;
>> >>>>
>> >>>> #include <stdio.h>
>> >>>> #define N 100351
>> >>>>
>> >>>> // This function computes 2D-5 point Jacobi stencil
>> >>>> void stencil(int a[restrict][N])
>> >>>> {
>> >>>>    int i, j, k;
>> >>>>    for (k = 0; k < 100; k++)
>> >>>>      {  for (i = 1; i <= N-2; i++)
>> >>>>          {  for (j = 1; j <= N-2; j++)
>> >>>>         { a[i][j] = 0.25 * (a[i][j] + a[i-1][j] + a[i+1][j] +
>> >>>> a[i][j-1] + a[i][j+1]);
>> >>>>               }
>> >>>> }
>> >>>> }}
>> >>>>
>> >>>> I have used the following commands
>> >>>>
>> >>>> clang  -S -emit-llvm stencil.c -march=knl -O3 -mllvm
>> >>>> -disable-llvm-optzns -o stencil.ll
>> >>>>
>> >>>> opt  -S -O3 stencil.ll -o stencil_o3.ll
>> >>>>
>> >>>> llc -x86-asm-syntax=intel stencil_o3.ll -o stencil.s
>> >>>>
>> >>>> But the code is not vectorized. It still uses the scalar
>> >>>> instructions;
>> >>>>
>> >>>> Please correct me.
>> >>>>
>> >>>> Thank You
>> > ,
>> >
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > llvm-dev at lists.llvm.org
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>