[llvm-dev] About discussion of vectorization pass and openmp `simd` and `ordered simd` directives
qiaopeixin via llvm-dev
llvm-dev at lists.llvm.org
Wed Sep 8 02:41:11 PDT 2021
Hi,
I would like to discuss the behaviors of openmp `simd` and `ordered simd` directives. I think current Clang may not give expected results as OpenMP 5.0 standard defines.
Let's start one c++ example:
```
void func(float *a, float *b, float *c, float *d, int N) {
#pragma omp simd
for (int i = 0; i < N; i++) {
d[i] = c[i] + 1.0;
#pragma omp ordered simd
a[i] = b[i] + 1.0;
}
}
```
What is expected according to OpenMP 5.0 standard is like the following:
```
void func(float *a, float *b, float *c, float *d, int N) {
for (int i = 0; i < N; i += 4) {
#pragma omp simd
for (int j = i; j < 4; j++)
d[i] = c[i] + 1.0; // vectorized
for (int j = i; j < 4; j++)
a[i] = b[i] + 1.0; // not vectorized
}
}
```
It seems that current Clang and LLVM do not support it.
Without openmp enabled, clang vectorizes the loop with memcheck as follows:
```
$ clang++ -O3 test.cpp -c -emit-llvm -S && cat test.ll
%scevgep = getelementptr float, float* %d, i64 %wide.trip.count
%scevgep22 = getelementptr float, float* %a, i64 %wide.trip.count
%scevgep25 = getelementptr float, float* %c, i64 %wide.trip.count
%scevgep28 = getelementptr float, float* %b, i64 %wide.trip.count
%bound030 = icmp ugt float* %scevgep25, %d
%bound131 = icmp ugt float* %scevgep, %c
%found.conflict32 = and i1 %bound030, %bound131
... fadd <4 x float> ...
```
With openmp-simd enabled, clang vectorizes the loop without memcheck. This means that only `simd` directive is enabled, while `ordered simd` directive is disabled. The results are expected.
```
clang++ -fopenmp-simd -O3 test.cpp -c -emit-llvm -S && cat test.ll
```
With openmp enabled, both `simd` and `ordered simd` directives are enabled. Clang frontend generates the outlined function `captured_stmt(float** %a.addr, i32* %i3, float** %b.addr)` with `AlwaysInline` attribute when optimization level is more than 0. The generated IR is to vectorize the loop with memcheck as follows:
```
$ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S && cat test.ll
%scevgep = getelementptr float, float* %d, i64 %wide.trip.count
%scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count
%scevgep32 = getelementptr float, float* %c, i64 %wide.trip.count
%scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count
%bound037 = icmp ugt float* %scevgep32, %d
%bound138 = icmp ugt float* %scevgep, %c
%found.conflict39 = and i1 %bound037, %bound138
... fadd <4 x float> ...
```
But the expected IR should be like the following:
```
%scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count
%scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count
%found.conflict...
... fadd <4 x float> ...
```
I have two questions here:
1. Does the outlined function `captured_stmt(float** %a.addr, i32* %i3, float** %b.addr)` with `AlwaysInline` attribute cause the memcheck? And how?
2. If my understanding is correct according to the above analysis, should the codegen of `ordered simd` directive be fixed to support the expected behaviors? And should `memcheck` function (emitMemRuntimeChecks) also support partial check instead of the whole region inside the loop?
Also, for the following test case, both of vectorization of `d[i] = c[i] + 1.0;` and `a[i] = a[i-1] + 1.0;` are disabled.
```
void func(float *a, float *b, float *c, float *d, int N) {
#pragma omp simd
for (int i = 1; i < N; i++) {
d[i] = c[i] + 1.0;
#pragma omp ordered simd
a[i] = a[i-1] + 1.0;
}
}
```
What is expected is to vectorize the statement `d[i] = c[i] + 1.0;`.
I also test icc and gcc and here are the results:
```
$ icc -v
icc version 2021.1
$ icc -qopenmp test.cpp -O3 -qopt-report -qopt-report-phase=vec -S && cat test.optrpt
LOOP BEGIN at test.cpp(3,3)
remark #15531: Block of statements was serialized due to user request [ test.cpp(5,5) ]
remark #15301: SIMD LOOP WAS VECTORIZED
LOOP END
$ g++ -v
gcc version 9.3.0 (GCC)
$ g++ test.cpp -fopenmp -fdump-tree-all -fdump-rtl-all -O3 -ftree-vectorize -S && cat test.s
fadd s0, s0, s1 // not vectorized
...
fadd s0, s0, s1 // not vectorized
// There is `GOMP_SIMD_ORDERED_START` and `GOMP_SIMD_ORDERED_END` before and after the statement of `a[i] = a[i-1] + 1.0` in ifcvt pass, after which they are used in vect pass to break the vectorization.
```
For the following test case:
```
void func(float *b, float *c, float *d, int N) {
float a[N];
for (int i = 0; i < N; i++)
a[i] = 0;
#pragma omp simd
for (int i = 1; i < N; i++) {
d[i] = c[i] + 1.0;
#pragma omp ordered simd
a[i] = a[i-1] + 1.0;
}
}
```
The IR generated is as follows:
```
$ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S
%scevgep = getelementptr float, float* %d, i64 1
%3 = add nuw nsw i64 %wide.trip.count, 1
%scevgep41 = getelementptr float, float* %d, i64 %3
%scevgep43 = getelementptr float, float* %c, i64 1
%scevgep45 = getelementptr float, float* %c, i64 %3
%bound0 = icmp ult float* %scevgep, %scevgep45
%bound1 = icmp ult float* %scevgep43, %scevgep41
%found.conflict = and i1 %bound0, %bound1
%induction = fadd <4 x float> %.splat, <float 0.000000e+00, float 1.000000e+00, float 2.000000e+00, float 3.000000e+00>
```
The result for the statement of `d[i] = c[i] + 1.0` and `a[i] = a[i-1] + 1.0` are both unexpected. It is safe to vectorize the statement of `a[i] = a[i-1] + 1.0` although it violates the definition of ordered construct in OpenMP 5.0 standard. But the memcheck of variables `d` and `c` should not be correct as the `simd` directive is there.
All the best,
Peixin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210908/3d91e6fb/attachment.html>
More information about the llvm-dev
mailing list