[Openmp-dev] About discussion of vectorization pass and openmp `simd` and `ordered simd` directives

Sun Sep 12 18:47:10 PDT 2021

Hi Johannes,

I don't think this is the right solution. The outlined function will affect the vectorization of simd region and make the simd region not vectorized. I tried to add `noinline` to the `ordered simd` outlined function before, and it breaks the vectorization of the whole simd region.

I agree with you that one possible method is to replicate the ordered code multiple times, but I think this method should be done in vectorization pass.

In addition, one fast way to correct current code generation is to remove `alwaysinline` attribute and add `noinline` attribute. I prefer to dig more on vectorization pass to check how does it stop vectorization of the whole for loop when there is one outlined function inside.

All the best,
Peixin

-----Original Message-----
From: Johannes Doerfert [mailto:jdoerfert at anl.gov] 
Sent: Sunday, September 12, 2021 1:08 AM
To: qiaopeixin <qiaopeixin at huawei.com>
Cc: openmp-dev <openmp-dev at lists.llvm.org>; Alexey Bataev <a.bataev at hotmail.com>
Subject: Re: About discussion of vectorization pass and openmp `simd` and `ordered simd` directives

Also, I thought about vectorizing this again.

I think what you can do to make it vectorize the code is to provide a scalarized function that replicates the ordered code N times and which is then given to the vectorizer as the "vector version" of the outlined function. That way you might need to add `noinline` to the `ordered simd` outlined function but it should allow us to vectorize the code properly.

Let me know if you are interested in that.

~ Johannes

On 9/10/21 7:57 PM, qiaopeixin wrote:
> Hi Johannes,
>
> I agree with your. I can take a try to remove the metadata for the outlined function to correct the current wrong code generation. If Alexey or you wants to do it, please let me know.
>
> All the best,
> Peixin
>
> -----Original Message-----
> From: Johannes Doerfert [mailto:jdoerfert at anl.gov]
> Sent: Saturday, September 11, 2021 3:08 AM
> To: qiaopeixin <qiaopeixin at huawei.com>
> Cc: openmp-dev <openmp-dev at lists.llvm.org>; Alexey Bataev 
> <a.bataev at hotmail.com>
> Subject: Re: About discussion of vectorization pass and openmp `simd` 
> and `ordered simd` directives
>
>
> On 9/8/21 8:10 PM, qiaopeixin wrote:
>> Hi Johannes,
>>
>> First, I would like to thank you and Alexey for the explanations.
>>
>>> As described, to make it at least sound we should not emit the access.group metadata for the call to the outlined function.
>> I would like make sure that I understand what code you want to generate. Not emitting the `access.group` metadata would cause all the code inside simd region not vectorized. Again, for the following code:
>> ```
>> void func(float *a, float *b, float *c, float *d, int N) {
>>     #pragma omp simd
>>     for (int i = 1; i < N; i++) {
>>       d[i] = c[i] + 1.0;
>>       #pragma omp ordered simd
>>       a[i] = a[i-1] + 1.0;
>>     }
>> }
>> ```
>> I tried to generate the IR from clang frontend and delete the metadata by hand. It turns out to break the vectorization of simd region. What I did is as follows:
>> $ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S -Xclang 
>> -disable-llvm-passes (Change "call void @__captured_stmt(float** 
>> %a.addr, i32* %i3), !llvm.access.group !15" into "call void
>> @__captured_stmt(float** %a.addr, i32* %i3)" by hand.) $ clang++ 
>> -fopenmp -O3 test.ll -c -emit-llvm -S
>> warning: <unknown>:0:0: loop not vectorized: the optimizer was unable 
>> to perform the requested transformation; the transformation might be 
>> disabled or specified as part of an unsupported transformation 
>> ordering [-Wpass-failed=transform-warning]
>> warning: <unknown>:0:0: loop not vectorized: the optimizer was unable 
>> to perform the requested transformation; the transformation might be 
>> disabled or specified as part of an unsupported transformation 
>> ordering [-Wpass-failed=transform-warning]
>> 2 warnings generated.
>>
>> Is this IR what you want to generate for now? I think the expected IR should vectorize the statement of ` d[i] = c[i] + 1.0` and serialize the statement of ` a[i] = a[i-1] + 1.0`. Do you want to delay this generation until vectorization pass supports patial vectorization in one for loop?
> Yes, that is the IR we want to generate until the loop vectorizer can be told to serialize some code.
> At least as far as I know it cannot right now.
>
> We could also investigate other solutions, e.g., unroll and do straight-line code vectorization based on access group metadata. Or split the loop. But everything will require more work.
>
> ~ Johannes
>
>
>> All the best,
>> Peixin
>>
>> -----Original Message-----
>> From: Johannes Doerfert [mailto:jdoerfert at anl.gov]
>> Sent: Thursday, September 9, 2021 12:00 AM
>> To: Alexey Bataev <a.bataev at hotmail.com>
>> Cc: qiaopeixin <qiaopeixin at huawei.com>; openmp-dev 
>> <openmp-dev at lists.llvm.org>
>> Subject: Re: About discussion of vectorization pass and openmp `simd` 
>> and `ordered simd` directives
>>
>> Peixin,
>>
>> are you interested in trying to fix this?
>>
>> As described, to make it at least sound we should not emit the access.group metadata for the call to the outlined function.
>> That will not necessarily resolve your "problem", e.g., that the vectorizer uses memory checks etc., but it will at least stop us from generating wrong code.
>>
>> ~ Johannes
>>
>>
>> On 9/8/21 10:57 AM, Alexey Bataev wrote:
>>> Best regards,
>>> Alexey Bataev
>>>
>>>> 8 сент. 2021 г., в 09:41, Johannes Doerfert <jdoerfert at anl.gov> написал(а):
>>>>
>>>> Hi Peixin,
>>>>
>>>> I think you are right that the code we generate is not correct.
>>>> The problem is not that a[i] is vectorized, the problem is that we 
>>>> might vectorize it without a memory check (with O2 instead of O3, 
>>>> see https://godbolt.org/z/9dnqMKexT).
>>>>
>>>> @Alexey, what was the intention of the outlined ordered region?
>>> No special intention, that was just a straightforward way implementing this feature.
>>>
>>>> I'm not really sure how to handle this best but the access.group on 
>>>> the call to the outlined region seems to be wrong as it implies 
>>>> vectorization is sound while it isn't. WDYT?
>>>>
>>> I agree, looks like the codegen for the vectorizable loops was changed but we did not adjust it for the ordered simd region. Probably, need to fix loop vectorization metadata emission for ordered simd regions.
>>>
>>>> ~ Johannes
>>>>
>>>>
>>>>> On 9/8/21 4:41 AM, qiaopeixin wrote:
>>>>> Hi,
>>>>>
>>>>> I would like to discuss the behaviors of openmp `simd` and `ordered simd` directives. I think current Clang may not give expected results as OpenMP 5.0 standard defines.
>>>>>
>>>>> Let's start one c++ example:
>>>>> ```
>>>>> void func(float *a, float *b, float *c, float *d, int N) {
>>>>>      #pragma omp simd
>>>>>      for (int i = 0; i < N; i++) {
>>>>>        d[i] = c[i] + 1.0;
>>>>>        #pragma omp ordered simd
>>>>>        a[i] = b[i] + 1.0;
>>>>>      }
>>>>> }
>>>>> ```
>>>>> What is expected according to OpenMP 5.0 standard is like the following:
>>>>> ```
>>>>> void func(float *a, float *b, float *c, float *d, int N) {
>>>>>      for (int i = 0; i < N; i += 4) {
>>>>>        #pragma omp simd
>>>>>        for (int j = i; j < 4; j++)
>>>>>          d[i] = c[i] + 1.0; // vectorized
>>>>>
>>>>>        for (int j = i; j < 4; j++)
>>>>>          a[i] = b[i] + 1.0; // not vectorized
>>>>>      }
>>>>> }
>>>>> ```
>>>>> It seems that current Clang and LLVM do not support it.
>>>>>
>>>>> Without openmp enabled, clang vectorizes the loop with memcheck as follows:
>>>>> ```
>>>>> $ clang++ -O3 test.cpp -c -emit-llvm -S && cat test.ll
>>>>>      %scevgep = getelementptr float, float* %d, i64 %wide.trip.count
>>>>>      %scevgep22 = getelementptr float, float* %a, i64 %wide.trip.count
>>>>>      %scevgep25 = getelementptr float, float* %c, i64 %wide.trip.count
>>>>>      %scevgep28 = getelementptr float, float* %b, i64 %wide.trip.count
>>>>>      %bound030 = icmp ugt float* %scevgep25, %d
>>>>>      %bound131 = icmp ugt float* %scevgep, %c
>>>>>      %found.conflict32 = and i1 %bound030, %bound131
>>>>>      ... fadd <4 x float> ...
>>>>> ```
>>>>>
>>>>> With openmp-simd enabled, clang vectorizes the loop without memcheck. This means that only `simd` directive is enabled, while `ordered simd` directive is disabled. The results are expected.
>>>>> ```
>>>>> clang++ -fopenmp-simd -O3 test.cpp -c -emit-llvm -S && cat test.ll
>>>>> ```
>>>>>
>>>>> With openmp enabled, both `simd` and `ordered simd` directives are enabled. Clang frontend generates the outlined function `captured_stmt(float** %a.addr, i32* %i3, float** %b.addr)` with `AlwaysInline` attribute when optimization level is more than 0. The generated IR is to vectorize the loop with memcheck as follows:
>>>>> ```
>>>>> $ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S && cat test.ll
>>>>>      %scevgep = getelementptr float, float* %d, i64 %wide.trip.count
>>>>>      %scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count
>>>>>      %scevgep32 = getelementptr float, float* %c, i64 %wide.trip.count
>>>>>      %scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count
>>>>>      %bound037 = icmp ugt float* %scevgep32, %d
>>>>>      %bound138 = icmp ugt float* %scevgep, %c
>>>>>      %found.conflict39 = and i1 %bound037, %bound138
>>>>>      ... fadd <4 x float> ...
>>>>> ```
>>>>> But the expected IR should be like the following:
>>>>> ```
>>>>>      %scevgep29 = getelementptr float, float* %a, i64 %wide.trip.count
>>>>>      %scevgep35 = getelementptr float, float* %b, i64 %wide.trip.count
>>>>>      %found.conflict...
>>>>>      ... fadd <4 x float> ...
>>>>> ```
>>>>> I have two questions here:
>>>>> 1. Does the outlined function `captured_stmt(float** %a.addr, i32* %i3, float** %b.addr)` with `AlwaysInline` attribute cause the memcheck? And how?
>>>>> 2. If my understanding is correct according to the above analysis, should the codegen of `ordered simd` directive be fixed to support the expected behaviors? And should `memcheck` function (emitMemRuntimeChecks) also support partial check instead of the whole region inside the loop?
>>>>>
>>>>> Also, for the following test case, both of vectorization of `d[i] = c[i] + 1.0;` and `a[i] = a[i-1] + 1.0;` are disabled.
>>>>> ```
>>>>> void func(float *a, float *b, float *c, float *d, int N) {
>>>>>      #pragma omp simd
>>>>>      for (int i = 1; i < N; i++) {
>>>>>        d[i] = c[i] + 1.0;
>>>>>        #pragma omp ordered simd
>>>>>        a[i] = a[i-1] + 1.0;
>>>>>      }
>>>>> }
>>>>> ```
>>>>> What is expected is to vectorize the statement `d[i] = c[i] + 1.0;`.
>>>>> I also test icc and gcc and here are the results:
>>>>> ```
>>>>> $ icc -v
>>>>> icc version 2021.1
>>>>> $ icc -qopenmp test.cpp -O3 -qopt-report -qopt-report-phase=vec -S 
>>>>> && cat test.optrpt LOOP BEGIN at test.cpp(3,3)
>>>>>       remark #15531: Block of statements was serialized due to user request   [ test.cpp(5,5) ]
>>>>>       remark #15301: SIMD LOOP WAS VECTORIZED LOOP END $ g++ -v 
>>>>> gcc version 9.3.0 (GCC) $ g++ test.cpp -fopenmp -fdump-tree-all 
>>>>> -fdump-rtl-all -O3 -ftree-vectorize -S && cat test.s
>>>>> fadd    s0, s0, s1 // not vectorized
>>>>> ...
>>>>> fadd    s0, s0, s1 // not vectorized
>>>>> // There is `GOMP_SIMD_ORDERED_START` and `GOMP_SIMD_ORDERED_END` before and after the statement of `a[i] = a[i-1] + 1.0` in ifcvt pass, after which they are used in vect pass to break the vectorization.
>>>>> ```
>>>>>
>>>>> For the following test case:
>>>>> ```
>>>>> void func(float *b, float *c, float *d, int N) {
>>>>>      float a[N];
>>>>>      for (int i = 0; i < N; i++)
>>>>>        a[i] = 0;
>>>>>      #pragma omp simd
>>>>>      for (int i = 1; i < N; i++) {
>>>>>        d[i] = c[i] + 1.0;
>>>>>        #pragma omp ordered simd
>>>>>        a[i] = a[i-1] + 1.0;
>>>>>      }
>>>>> }
>>>>> ```
>>>>> The IR generated is as follows:
>>>>> ```
>>>>> $ clang++ -fopenmp -O3 test.cpp -c -emit-llvm -S
>>>>>      %scevgep = getelementptr float, float* %d, i64 1
>>>>>      %3 = add nuw nsw i64 %wide.trip.count, 1
>>>>>      %scevgep41 = getelementptr float, float* %d, i64 %3
>>>>>      %scevgep43 = getelementptr float, float* %c, i64 1
>>>>>      %scevgep45 = getelementptr float, float* %c, i64 %3
>>>>>      %bound0 = icmp ult float* %scevgep, %scevgep45
>>>>>      %bound1 = icmp ult float* %scevgep43, %scevgep41
>>>>>      %found.conflict = and i1 %bound0, %bound1
>>>>>      %induction = fadd <4 x float> %.splat, <float 0.000000e+00, 
>>>>> float 1.000000e+00, float 2.000000e+00, float 3.000000e+00> ``` 
>>>>> The result for the statement of `d[i] = c[i] + 1.0` and `a[i] = a[i-1] + 1.0` are both unexpected. It is safe to vectorize the statement of `a[i] = a[i-1] + 1.0` although it violates the definition of ordered construct in OpenMP 5.0 standard. But the memcheck of variables `d` and `c` should not be correct as the `simd` directive is there.
>>>>>
>>>>> All the best,
>>>>> Peixin
>>>>>