[llvm-dev] About discussion of vectorization pass and openmp `simd` and `ordered simd` directives

Wed Sep 8 02:41:11 PDT 2021

Hi,

I would like to discuss the behaviors of openmp `simd` and `ordered simd` directives. I think current Clang may not give expected results as OpenMP 5.0 standard defines.

Let's start one c++ example:
```
void func(float *a, float *b, float *c, float *d, int N) {
  #pragma omp simd
  for (int i = 0; i < N; i++) {
    d[i] = c[i] + 1.0;
    #pragma omp ordered simd
    a[i] = b[i] + 1.0;
  }
}
```
What is expected according to OpenMP 5.0 standard is like the following:
```
void func(float *a, float *b, float *c, float *d, int N) {
  for (int i = 0; i < N; i += 4) {
    #pragma omp simd
    for (int j = i; j < 4; j++)
      d[i] = c[i] + 1.0; // vectorized

    for (int j = i; j < 4; j++)
      a[i] = b[i] + 1.0; // not vectorized
  }
}
```
It seems that current Clang and LLVM do not support it.

Without openmp enabled, clang vectorizes the loop with memcheck as follows:
```
$ clang++ -O3 test.cpp -c -emit-llvm -S && cat test.ll
  %scevgep = getelementptr float, float* %d, i64 %wide.trip.count
  %scevgep22 = getelementptr float, float* %a, i64 %wide.trip.count
  %scevgep25 = getelementptr float, float* %c, i64 %wide.trip.count
  %scevgep28 = getelementptr float, float* %b, i64 %wide.trip.count
  %bound030 = icmp ugt float* %scevgep25, %d
  %bound131 = icmp ugt float* %scevgep, %c
  %found.conflict32 = and i1 %bound030, %bound131
  ... fadd <4 x float> ...
```

With openmp-simd enabled, clang vectorizes the loop without memcheck. This means that only `simd` directive is enabled, while `ordered simd` directive is disabled. The results are expected.
```
clang++ -fopenmp-simd -O3 test.cpp -c -emit-llvm -S && cat test.ll
```

With openmp enabled, both `simd` and `ordered simd` directives are enabled. Clang frontend generates the outlined function `captured_stmt(float** %a.addr, i32* %i3, float** %b.addr)` with `AlwaysInline` attribute when optimization level is more than 0. The generated IR is to vectorize the loop with memcheck as follows:
```
$ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S && cat test.ll
  %scevgep = getelementptr float, float* %d, i64 %wide.trip.count
  %scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count
  %scevgep32 = getelementptr float, float* %c, i64 %wide.trip.count
  %scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count
  %bound037 = icmp ugt float* %scevgep32, %d
  %bound138 = icmp ugt float* %scevgep, %c
  %found.conflict39 = and i1 %bound037, %bound138
  ... fadd <4 x float> ...
```
But the expected IR should be like the following:
```
  %scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count
  %scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count
  %found.conflict...
  ... fadd <4 x float> ...
```
I have two questions here:
1. Does the outlined function `captured_stmt(float** %a.addr, i32* %i3, float** %b.addr)` with `AlwaysInline` attribute cause the memcheck? And how?
2. If my understanding is correct according to the above analysis, should the codegen of `ordered simd` directive be fixed to support the expected behaviors? And should `memcheck` function (emitMemRuntimeChecks) also support partial check instead of the whole region inside the loop?

Also, for the following test case, both of vectorization of `d[i] = c[i] + 1.0;` and `a[i] = a[i-1] + 1.0;` are disabled.
```
void func(float *a, float *b, float *c, float *d, int N) {
  #pragma omp simd
  for (int i = 1; i < N; i++) {
    d[i] = c[i] + 1.0;
    #pragma omp ordered simd
    a[i] = a[i-1] + 1.0;
  }
}
```
What is expected is to vectorize the statement `d[i] = c[i] + 1.0;`.
I also test icc and gcc and here are the results:
```
$ icc -v
icc version 2021.1
$ icc -qopenmp test.cpp -O3 -qopt-report -qopt-report-phase=vec -S && cat test.optrpt
LOOP BEGIN at test.cpp(3,3)
   remark #15531: Block of statements was serialized due to user request   [ test.cpp(5,5) ]
   remark #15301: SIMD LOOP WAS VECTORIZED
LOOP END
$ g++ -v
gcc version 9.3.0 (GCC)
$ g++ test.cpp -fopenmp -fdump-tree-all -fdump-rtl-all -O3 -ftree-vectorize -S && cat test.s
fadd    s0, s0, s1 // not vectorized
...
fadd    s0, s0, s1 // not vectorized
// There is `GOMP_SIMD_ORDERED_START` and `GOMP_SIMD_ORDERED_END` before and after the statement of `a[i] = a[i-1] + 1.0` in ifcvt pass, after which they are used in vect pass to break the vectorization.
```

For the following test case:
```
void func(float *b, float *c, float *d, int N) {
  float a[N];
  for (int i = 0; i < N; i++)
    a[i] = 0;
  #pragma omp simd
  for (int i = 1; i < N; i++) {
    d[i] = c[i] + 1.0;
    #pragma omp ordered simd
    a[i] = a[i-1] + 1.0;
  }
}
```
The IR generated is as follows:
```
$ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S
  %scevgep = getelementptr float, float* %d, i64 1
  %3 = add nuw nsw i64 %wide.trip.count, 1
  %scevgep41 = getelementptr float, float* %d, i64 %3
  %scevgep43 = getelementptr float, float* %c, i64 1
  %scevgep45 = getelementptr float, float* %c, i64 %3
  %bound0 = icmp ult float* %scevgep, %scevgep45
  %bound1 = icmp ult float* %scevgep43, %scevgep41
  %found.conflict = and i1 %bound0, %bound1
  %induction = fadd <4 x float> %.splat, <float 0.000000e+00, float 1.000000e+00, float 2.000000e+00, float 3.000000e+00>
```
The result for the statement of `d[i] = c[i] + 1.0` and `a[i] = a[i-1] + 1.0` are both unexpected. It is safe to vectorize the statement of `a[i] = a[i-1] + 1.0` although it violates the definition of ordered construct in OpenMP 5.0 standard. But the memcheck of variables `d` and `c` should not be correct as the `simd` directive is there.

All the best,
Peixin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210908/3d91e6fb/attachment.html>