[LLVMdev] Why is the loop vectorizer not working on my function?
Arnold Schwaighofer
aschwaighofer at apple.com
Sat Oct 26 12:47:04 PDT 2013
>>> LV: The Widest type: 32 bits.
>>> LV: The Widest register is: 32 bits.
Yep, we don’t pick up the right TTI.
Try -march=x86-64 (or leave it out) you already have this info in the triple.
Then it should work (does for me with your example below).
On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at jlab.org> wrote:
> Hi Hal!
>
> I am using the 'x86_64' target. Below the complete module dump and here the command line:
>
> opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S test.ll
>
> Frank
>
>
> ; ModuleID = 'test.ll'
>
> target datalayout = "e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
> 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
>
> target triple = "x86_64-unknown-linux-elf"
>
> define void @bar([8 x i8]* %arg_ptr) {
> entrypoint:
> %0 = bitcast [8 x i8]* %arg_ptr to i32*
> %1 = load i32* %0
> %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
> %3 = bitcast [8 x i8]* %2 to i32*
> %4 = load i32* %3
> %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
> %6 = bitcast [8 x i8]* %5 to float**
> %7 = load float** %6
> %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
> %9 = bitcast [8 x i8]* %8 to float**
> %10 = load float** %9
> %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
> %12 = bitcast [8 x i8]* %11 to float**
> %13 = load float** %12
> br label %L0
>
> L0: ; preds = %L0, %entrypoint
> %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
> %15 = getelementptr float* %10, i32 %14
> %16 = load float* %15
> %17 = getelementptr float* %13, i32 %14
> %18 = load float* %17
> %19 = fmul float %18, %16
> %20 = getelementptr float* %7, i32 %14
> store float %19, float* %20
> %21 = add nsw i32 %14, 1
> %22 = icmp sge i32 %21, %4
> br i1 %22, label %L1, label %L0
>
> L1: ; preds = %L0
> ret void
> }
>
>
>
> On 26/10/13 15:08, Hal Finkel wrote:
>> ----- Original Message -----
>>> Hi Arnold,
>>>
>>> adding '-debug-only=loop-vectorize' to the command gives:
>>>
>>> LV: Checking a loop in "bar"
>>> LV: Found a loop: L0
>>> LV: Found an induction variable.
>>> LV: Found an unidentified write ptr: %7 = load float** %6
>>> LV: Found an unidentified read ptr: %10 = load float** %9
>>> LV: Found an unidentified read ptr: %13 = load float** %12
>>> LV: We need to do 2 pointer comparisons.
>>> LV: We can't vectorize because we can't find the array bounds.
>>> LV: Can't vectorize due to memory conflicts
>>> LV: Not vectorizing.
>>>
>>> It can't find the loop bounds if we use the overflow version of add.
>>> That's a good point. I should mark this addition to not overflow.
>>>
>>> When using the non-overflow version I get:
>>>
>>> LV: Checking a loop in "bar"
>>> LV: Found a loop: L0
>>> LV: Found an induction variable.
>>> LV: Found an unidentified write ptr: %7 = load float** %6
>>> LV: Found an unidentified read ptr: %10 = load float** %9
>>> LV: Found an unidentified read ptr: %13 = load float** %12
>>> LV: Found a runtime check ptr: %20 = getelementptr float* %7, i32
>>> %14
>>> LV: Found a runtime check ptr: %15 = getelementptr float* %10, i32
>>> %14
>>> LV: Found a runtime check ptr: %17 = getelementptr float* %13, i32
>>> %14
>>> LV: We need to do 2 pointer comparisons.
>>> LV: We can perform a memory runtime check if needed.
>>> LV: We need a runtime memory check.
>>> LV: We can vectorize this loop (with a runtime bound check)!
>>> LV: Found trip count: 0
>>> LV: The Widest type: 32 bits.
>>> LV: The Widest register is: 32 bits.
>>> LV: Found an estimated cost of 0 for VF 1 For instruction: %14 =
>>> phi
>>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>> LV: Found an estimated cost of 0 for VF 1 For instruction: %15 =
>>> getelementptr float* %10, i32 %14
>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %16 =
>>> load
>>> float* %15
>>> LV: Found an estimated cost of 0 for VF 1 For instruction: %17 =
>>> getelementptr float* %13, i32 %14
>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %18 =
>>> load
>>> float* %17
>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %19 =
>>> fmul
>>> float %18, %16
>>> LV: Found an estimated cost of 0 for VF 1 For instruction: %20 =
>>> getelementptr float* %7, i32 %14
>>> LV: Found an estimated cost of 1 for VF 1 For instruction: store
>>> float
>>> %19, float* %20
>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %21 =
>>> add
>>> nsw i32 %14, 1
>>> LV: Found an estimated cost of 1 for VF 1 For instruction: %22 =
>>> icmp
>>> sge i32 %21, %4
>>> LV: Found an estimated cost of 1 for VF 1 For instruction: br i1
>>> %22,
>>> label %L1, label %L0
>>> LV: Scalar loop costs: 7.
>>> LV: Selecting VF = : 1.
>>> LV: The target has 8 vector registers
>>> LV(REG): Calculating max register usage:
>>> LV(REG): At #0 Interval # 0
>>> LV(REG): At #1 Interval # 1
>>> LV(REG): At #2 Interval # 2
>>> LV(REG): At #3 Interval # 2
>>> LV(REG): At #4 Interval # 3
>>> LV(REG): At #5 Interval # 3
>>> LV(REG): At #6 Interval # 2
>>> LV(REG): At #8 Interval # 1
>>> LV(REG): At #9 Interval # 1
>>> LV(REG): Found max usage: 3
>>> LV(REG): Found invariant usage: 5
>>> LV(REG): LoopSize: 11
>>> LV: Vectorization is possible but not beneficial.
>>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
>>> LV: Unroll Factor is 1
>>>
>>> It's not beneficial? I didn't expect that. Do you have a descriptive
>>> explanation why it's not beneficial?
>> It looks like the vectorizer is not picking up a TTI implementation from a target with vector registers (likely, you're just seeing the basic cost model). For what target is this?
>>
>> -Hal
>>
>>> Frank
>>>
>>>
>>>
>>> On 26/10/13 13:03, Arnold wrote:
>>>> Hi Frank,
>>>>
>>>> Sent from my iPhone
>>>>
>>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org>
>>>>> wrote:
>>>>>
>>>>> My function implements a simple loop:
>>>>>
>>>>> void bar( int start, int end, float* A, float* B, float* C)
>>>>> {
>>>>> for (int i=start; i<end;++i)
>>>>> A[i] = B[i] * C[i];
>>>>> }
>>>>>
>>>>> This looks pretty much like the standard example. However, I built
>>>>> the function
>>>>> with the IRBuilder, thus not coming from C and clang. Also I
>>>>> changed slightly
>>>>> the function's signature:
>>>>>
>>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>>> entrypoint:
>>>>> %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>> %1 = load i32* %0
>>>>> %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>> %3 = bitcast [8 x i8]* %2 to i32*
>>>>> %4 = load i32* %3
>>>>> %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>> %6 = bitcast [8 x i8]* %5 to float**
>>>>> %7 = load float** %6
>>>>> %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>> %9 = bitcast [8 x i8]* %8 to float**
>>>>> %10 = load float** %9
>>>>> %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>> %12 = bitcast [8 x i8]* %11 to float**
>>>>> %13 = load float** %12
>>>>> br label %L0
>>>>>
>>>>> L0: ; preds = %L0,
>>>>> %entrypoint
>>>>> %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>> %15 = getelementptr float* %10, i32 %14
>>>>> %16 = load float* %15
>>>>> %17 = getelementptr float* %13, i32 %14
>>>>> %18 = load float* %17
>>>>> %19 = fmul float %18, %16
>>>>> %20 = getelementptr float* %7, i32 %14
>>>>> store float %19, float* %20
>>>>> %21 = add i32 %14, 1
>>>> Try
>>>> %21 = add nsw i32 %14, 1
>>>> instead for no-signed wrapping arithmetic.
>>>>
>>>> If that is not working please post the output of opt ...
>>>> -debug-only=loop-vectorize ...
>>>>
>>>>
>>>>
>>>>> %22 = icmp sge i32 %21, %4
>>>>> br i1 %22, label %L1, label %L0
>>>>>
>>>>> L1: ; preds = %L0
>>>>> ret void
>>>>> }
>>>>>
>>>>>
>>>>> As you can see, I use the phi instruction for the loop index. I
>>>>> notice
>>>>> that clang prefers stack allocation. So, I am not sure what's the
>>>>> problem that the loop vectorizer is not working here.
>>>>> I tried many things, like specifying an architecture with vector
>>>>> units, enforcing the vector width. No success.
>>>>>
>>>>> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S loop.ll
>>>>>
>>>>> The only explanation I have is the use of the phi instruction. Is
>>>>> this
>>>>> preventing to vectorize the loop?
>>>>>
>>>>> Frank
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>
>
More information about the llvm-dev
mailing list