[Openmp-dev] About discussion of vectorization pass and openmp `simd` and `ordered simd` directives

Wed Sep 8 08:59:47 PDT 2021

Peixin,

are you interested in trying to fix this?

As described, to make it at least sound we should not emit
the access.group metadata for the call to the outlined function.
That will not necessarily resolve your "problem", e.g., that the
vectorizer uses memory checks etc., but it will at least stop us
from generating wrong code.

~ Johannes

On 9/8/21 10:57 AM, Alexey Bataev wrote:
>
> Best regards,
> Alexey Bataev
>
>> 8 сент. 2021 г., в 09:41, Johannes Doerfert <jdoerfert at anl.gov> написал(а):
>>
>> Hi Peixin,
>>
>> I think you are right that the code we generate is not correct.
>> The problem is not that a[i] is vectorized, the problem is that
>> we might vectorize it without a memory check (with O2 instead of
>> O3, see https://godbolt.org/z/9dnqMKexT).
>>
>> @Alexey, what was the intention of the outlined ordered region?
> No special intention, that was just a straightforward way implementing this feature.
>
>> I'm not really sure how to handle this best but the access.group
>> on the call to the outlined region seems to be wrong as it implies
>> vectorization is sound while it isn't. WDYT?
>>
> I agree, looks like the codegen for the vectorizable loops was changed but we did not adjust it for the ordered simd region. Probably, need to fix loop vectorization metadata emission for ordered simd regions.
>
>> ~ Johannes
>>
>>
>>> On 9/8/21 4:41 AM, qiaopeixin wrote:
>>> Hi,
>>>
>>> I would like to discuss the behaviors of openmp `simd` and `ordered simd` directives. I think current Clang may not give expected results as OpenMP 5.0 standard defines.
>>>
>>> Let's start one c++ example:
>>> ```
>>> void func(float *a, float *b, float *c, float *d, int N) {
>>>    #pragma omp simd
>>>    for (int i = 0; i < N; i++) {
>>>      d[i] = c[i] + 1.0;
>>>      #pragma omp ordered simd
>>>      a[i] = b[i] + 1.0;
>>>    }
>>> }
>>> ```
>>> What is expected according to OpenMP 5.0 standard is like the following:
>>> ```
>>> void func(float *a, float *b, float *c, float *d, int N) {
>>>    for (int i = 0; i < N; i += 4) {
>>>      #pragma omp simd
>>>      for (int j = i; j < 4; j++)
>>>        d[i] = c[i] + 1.0; // vectorized
>>>
>>>      for (int j = i; j < 4; j++)
>>>        a[i] = b[i] + 1.0; // not vectorized
>>>    }
>>> }
>>> ```
>>> It seems that current Clang and LLVM do not support it.
>>>
>>> Without openmp enabled, clang vectorizes the loop with memcheck as follows:
>>> ```
>>> $ clang++ -O3 test.cpp -c -emit-llvm -S && cat test.ll
>>>    %scevgep = getelementptr float, float* %d, i64 %wide.trip.count
>>>    %scevgep22 = getelementptr float, float* %a, i64 %wide.trip.count
>>>    %scevgep25 = getelementptr float, float* %c, i64 %wide.trip.count
>>>    %scevgep28 = getelementptr float, float* %b, i64 %wide.trip.count
>>>    %bound030 = icmp ugt float* %scevgep25, %d
>>>    %bound131 = icmp ugt float* %scevgep, %c
>>>    %found.conflict32 = and i1 %bound030, %bound131
>>>    ... fadd <4 x float> ...
>>> ```
>>>
>>> With openmp-simd enabled, clang vectorizes the loop without memcheck. This means that only `simd` directive is enabled, while `ordered simd` directive is disabled. The results are expected.
>>> ```
>>> clang++ -fopenmp-simd -O3 test.cpp -c -emit-llvm -S && cat test.ll
>>> ```
>>>
>>> With openmp enabled, both `simd` and `ordered simd` directives are enabled. Clang frontend generates the outlined function `captured_stmt(float** %a.addr, i32* %i3, float** %b.addr)` with `AlwaysInline` attribute when optimization level is more than 0. The generated IR is to vectorize the loop with memcheck as follows:
>>> ```
>>> $ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S && cat test.ll
>>>    %scevgep = getelementptr float, float* %d, i64 %wide.trip.count
>>>    %scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count
>>>    %scevgep32 = getelementptr float, float* %c, i64 %wide.trip.count
>>>    %scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count
>>>    %bound037 = icmp ugt float* %scevgep32, %d
>>>    %bound138 = icmp ugt float* %scevgep, %c
>>>    %found.conflict39 = and i1 %bound037, %bound138
>>>    ... fadd <4 x float> ...
>>> ```
>>> But the expected IR should be like the following:
>>> ```
>>>    %scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count
>>>    %scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count
>>>    %found.conflict...
>>>    ... fadd <4 x float> ...
>>> ```
>>> I have two questions here:
>>> 1. Does the outlined function `captured_stmt(float** %a.addr, i32* %i3, float** %b.addr)` with `AlwaysInline` attribute cause the memcheck? And how?
>>> 2. If my understanding is correct according to the above analysis, should the codegen of `ordered simd` directive be fixed to support the expected behaviors? And should `memcheck` function (emitMemRuntimeChecks) also support partial check instead of the whole region inside the loop?
>>>
>>> Also, for the following test case, both of vectorization of `d[i] = c[i] + 1.0;` and `a[i] = a[i-1] + 1.0;` are disabled.
>>> ```
>>> void func(float *a, float *b, float *c, float *d, int N) {
>>>    #pragma omp simd
>>>    for (int i = 1; i < N; i++) {
>>>      d[i] = c[i] + 1.0;
>>>      #pragma omp ordered simd
>>>      a[i] = a[i-1] + 1.0;
>>>    }
>>> }
>>> ```
>>> What is expected is to vectorize the statement `d[i] = c[i] + 1.0;`.
>>> I also test icc and gcc and here are the results:
>>> ```
>>> $ icc -v
>>> icc version 2021.1
>>> $ icc -qopenmp test.cpp -O3 -qopt-report -qopt-report-phase=vec -S && cat test.optrpt
>>> LOOP BEGIN at test.cpp(3,3)
>>>     remark #15531: Block of statements was serialized due to user request   [ test.cpp(5,5) ]
>>>     remark #15301: SIMD LOOP WAS VECTORIZED
>>> LOOP END
>>> $ g++ -v
>>> gcc version 9.3.0 (GCC)
>>> $ g++ test.cpp -fopenmp -fdump-tree-all -fdump-rtl-all -O3 -ftree-vectorize -S && cat test.s
>>> fadd    s0, s0, s1 // not vectorized
>>> ...
>>> fadd    s0, s0, s1 // not vectorized
>>> // There is `GOMP_SIMD_ORDERED_START` and `GOMP_SIMD_ORDERED_END` before and after the statement of `a[i] = a[i-1] + 1.0` in ifcvt pass, after which they are used in vect pass to break the vectorization.
>>> ```
>>>
>>> For the following test case:
>>> ```
>>> void func(float *b, float *c, float *d, int N) {
>>>    float a[N];
>>>    for (int i = 0; i < N; i++)
>>>      a[i] = 0;
>>>    #pragma omp simd
>>>    for (int i = 1; i < N; i++) {
>>>      d[i] = c[i] + 1.0;
>>>      #pragma omp ordered simd
>>>      a[i] = a[i-1] + 1.0;
>>>    }
>>> }
>>> ```
>>> The IR generated is as follows:
>>> ```
>>> $ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S
>>>    %scevgep = getelementptr float, float* %d, i64 1
>>>    %3 = add nuw nsw i64 %wide.trip.count, 1
>>>    %scevgep41 = getelementptr float, float* %d, i64 %3
>>>    %scevgep43 = getelementptr float, float* %c, i64 1
>>>    %scevgep45 = getelementptr float, float* %c, i64 %3
>>>    %bound0 = icmp ult float* %scevgep, %scevgep45
>>>    %bound1 = icmp ult float* %scevgep43, %scevgep41
>>>    %found.conflict = and i1 %bound0, %bound1
>>>    %induction = fadd <4 x float> %.splat, <float 0.000000e+00, float 1.000000e+00, float 2.000000e+00, float 3.000000e+00>
>>> ```
>>> The result for the statement of `d[i] = c[i] + 1.0` and `a[i] = a[i-1] + 1.0` are both unexpected. It is safe to vectorize the statement of `a[i] = a[i-1] + 1.0` although it violates the definition of ordered construct in OpenMP 5.0 standard. But the memcheck of variables `d` and `c` should not be correct as the `simd` directive is there.
>>>
>>> All the best,
>>> Peixin
>>>