[llvm-dev] About discussion of vectorization pass and openmp `simd` and `ordered simd` directives

Wed Sep 8 06:35:22 PDT 2021

Hi Peixin,

First, I think CC'ing a lot of folks is not always the best strategy.
This is also a topic for openmp-dev and I'll reply there instead.

~ Johannes

On 9/8/21 4:41 AM, qiaopeixin wrote:
> Hi,
>
> I would like to discuss the behaviors of openmp `simd` and `ordered simd` directives. I think current Clang may not give expected results as OpenMP 5.0 standard defines.
>
> Let's start one c++ example:
> ```
> void func(float *a, float *b, float *c, float *d, int N) {
>    #pragma omp simd
>    for (int i = 0; i < N; i++) {
>      d[i] = c[i] + 1.0;
>      #pragma omp ordered simd
>      a[i] = b[i] + 1.0;
>    }
> }
> ```
> What is expected according to OpenMP 5.0 standard is like the following:
> ```
> void func(float *a, float *b, float *c, float *d, int N) {
>    for (int i = 0; i < N; i += 4) {
>      #pragma omp simd
>      for (int j = i; j < 4; j++)
>        d[i] = c[i] + 1.0; // vectorized
>
>      for (int j = i; j < 4; j++)
>        a[i] = b[i] + 1.0; // not vectorized
>    }
> }
> ```
> It seems that current Clang and LLVM do not support it.
>
> Without openmp enabled, clang vectorizes the loop with memcheck as follows:
> ```
> $ clang++ -O3 test.cpp -c -emit-llvm -S && cat test.ll
>    %scevgep = getelementptr float, float* %d, i64 %wide.trip.count
>    %scevgep22 = getelementptr float, float* %a, i64 %wide.trip.count
>    %scevgep25 = getelementptr float, float* %c, i64 %wide.trip.count
>    %scevgep28 = getelementptr float, float* %b, i64 %wide.trip.count
>    %bound030 = icmp ugt float* %scevgep25, %d
>    %bound131 = icmp ugt float* %scevgep, %c
>    %found.conflict32 = and i1 %bound030, %bound131
>    ... fadd <4 x float> ...
> ```
>
> With openmp-simd enabled, clang vectorizes the loop without memcheck. This means that only `simd` directive is enabled, while `ordered simd` directive is disabled. The results are expected.
> ```
> clang++ -fopenmp-simd -O3 test.cpp -c -emit-llvm -S && cat test.ll
> ```
>
> With openmp enabled, both `simd` and `ordered simd` directives are enabled. Clang frontend generates the outlined function `captured_stmt(float** %a.addr, i32* %i3, float** %b.addr)` with `AlwaysInline` attribute when optimization level is more than 0. The generated IR is to vectorize the loop with memcheck as follows:
> ```
> $ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S && cat test.ll
>    %scevgep = getelementptr float, float* %d, i64 %wide.trip.count
>    %scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count
>    %scevgep32 = getelementptr float, float* %c, i64 %wide.trip.count
>    %scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count
>    %bound037 = icmp ugt float* %scevgep32, %d
>    %bound138 = icmp ugt float* %scevgep, %c
>    %found.conflict39 = and i1 %bound037, %bound138
>    ... fadd <4 x float> ...
> ```
> But the expected IR should be like the following:
> ```
>    %scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count
>    %scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count
>    %found.conflict...
>    ... fadd <4 x float> ...
> ```
> I have two questions here:
> 1. Does the outlined function `captured_stmt(float** %a.addr, i32* %i3, float** %b.addr)` with `AlwaysInline` attribute cause the memcheck? And how?
> 2. If my understanding is correct according to the above analysis, should the codegen of `ordered simd` directive be fixed to support the expected behaviors? And should `memcheck` function (emitMemRuntimeChecks) also support partial check instead of the whole region inside the loop?
>
> Also, for the following test case, both of vectorization of `d[i] = c[i] + 1.0;` and `a[i] = a[i-1] + 1.0;` are disabled.
> ```
> void func(float *a, float *b, float *c, float *d, int N) {
>    #pragma omp simd
>    for (int i = 1; i < N; i++) {
>      d[i] = c[i] + 1.0;
>      #pragma omp ordered simd
>      a[i] = a[i-1] + 1.0;
>    }
> }
> ```
> What is expected is to vectorize the statement `d[i] = c[i] + 1.0;`.
> I also test icc and gcc and here are the results:
> ```
> $ icc -v
> icc version 2021.1
> $ icc -qopenmp test.cpp -O3 -qopt-report -qopt-report-phase=vec -S && cat test.optrpt
> LOOP BEGIN at test.cpp(3,3)
>     remark #15531: Block of statements was serialized due to user request   [ test.cpp(5,5) ]
>     remark #15301: SIMD LOOP WAS VECTORIZED
> LOOP END
> $ g++ -v
> gcc version 9.3.0 (GCC)
> $ g++ test.cpp -fopenmp -fdump-tree-all -fdump-rtl-all -O3 -ftree-vectorize -S && cat test.s
> fadd    s0, s0, s1 // not vectorized
> ...
> fadd    s0, s0, s1 // not vectorized
> // There is `GOMP_SIMD_ORDERED_START` and `GOMP_SIMD_ORDERED_END` before and after the statement of `a[i] = a[i-1] + 1.0` in ifcvt pass, after which they are used in vect pass to break the vectorization.
> ```
>
> For the following test case:
> ```
> void func(float *b, float *c, float *d, int N) {
>    float a[N];
>    for (int i = 0; i < N; i++)
>      a[i] = 0;
>    #pragma omp simd
>    for (int i = 1; i < N; i++) {
>      d[i] = c[i] + 1.0;
>      #pragma omp ordered simd
>      a[i] = a[i-1] + 1.0;
>    }
> }
> ```
> The IR generated is as follows:
> ```
> $ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S
>    %scevgep = getelementptr float, float* %d, i64 1
>    %3 = add nuw nsw i64 %wide.trip.count, 1
>    %scevgep41 = getelementptr float, float* %d, i64 %3
>    %scevgep43 = getelementptr float, float* %c, i64 1
>    %scevgep45 = getelementptr float, float* %c, i64 %3
>    %bound0 = icmp ult float* %scevgep, %scevgep45
>    %bound1 = icmp ult float* %scevgep43, %scevgep41
>    %found.conflict = and i1 %bound0, %bound1
>    %induction = fadd <4 x float> %.splat, <float 0.000000e+00, float 1.000000e+00, float 2.000000e+00, float 3.000000e+00>
> ```
> The result for the statement of `d[i] = c[i] + 1.0` and `a[i] = a[i-1] + 1.0` are both unexpected. It is safe to vectorize the statement of `a[i] = a[i-1] + 1.0` although it violates the definition of ordered construct in OpenMP 5.0 standard. But the memcheck of variables `d` and `c` should not be correct as the `simd` directive is there.
>
> All the best,
> Peixin
>