[LLVMdev] Why is the loop vectorizer not working on my function?

Frank Winter fwinter at jlab.org
Sat Oct 26 13:10:03 PDT 2013


If I leave '-march=x86-64' out, it works for me too.

Regards this message:

LV: We need a runtime memory check.
LV: We can vectorize this loop (with a runtime bound check)!

I know, my pointers are neither aliasing each other nor do they point to 
overlapping memory regions.

Is there a way to mark my pointers 'noalias' after loading, e.g. after

   %7 = load float** %6

in order to avoid the runtime check?

Frank



On 26/10/13 15:54, Hal Finkel wrote:
> ----- Original Message -----
>>>>> LV: The Widest type: 32 bits.
>>>>> LV: The Widest register is: 32 bits.
>> Yep, we don’t pick up the right TTI.
>>
>> Try -march=x86-64 (or leave it out) you already have this info in the
>> triple.
>>
>> Then it should work (does for me with your example below).
> That may depend on what CPU is picks by default; Frank, if it does not work for you, try specifying a target CPU (-mcpu=whatever).
>
>   -Hal
>
>>
>> On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at jlab.org> wrote:
>>
>>> Hi Hal!
>>>
>>> I am using the 'x86_64' target. Below the complete module dump and
>>> here the command line:
>>>
>>> opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S
>>> test.ll
>>>
>>> Frank
>>>
>>>
>>> ; ModuleID = 'test.ll'
>>>
>>> target datalayout =
>>> "e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
>>> 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
>>>
>>> target triple = "x86_64-unknown-linux-elf"
>>>
>>> define void @bar([8 x i8]* %arg_ptr) {
>>> entrypoint:
>>>   %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>   %1 = load i32* %0
>>>   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>   %3 = bitcast [8 x i8]* %2 to i32*
>>>   %4 = load i32* %3
>>>   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>   %6 = bitcast [8 x i8]* %5 to float**
>>>   %7 = load float** %6
>>>   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>   %9 = bitcast [8 x i8]* %8 to float**
>>>   %10 = load float** %9
>>>   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>   %12 = bitcast [8 x i8]* %11 to float**
>>>   %13 = load float** %12
>>>   br label %L0
>>>
>>> L0:                                               ; preds = %L0,
>>> %entrypoint
>>>   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>   %15 = getelementptr float* %10, i32 %14
>>>   %16 = load float* %15
>>>   %17 = getelementptr float* %13, i32 %14
>>>   %18 = load float* %17
>>>   %19 = fmul float %18, %16
>>>   %20 = getelementptr float* %7, i32 %14
>>>   store float %19, float* %20
>>>   %21 = add nsw i32 %14, 1
>>>   %22 = icmp sge i32 %21, %4
>>>   br i1 %22, label %L1, label %L0
>>>
>>> L1:                                               ; preds = %L0
>>>   ret void
>>> }
>>>
>>>
>>>
>>> On 26/10/13 15:08, Hal Finkel wrote:
>>>> ----- Original Message -----
>>>>> Hi Arnold,
>>>>>
>>>>> adding '-debug-only=loop-vectorize' to the command gives:
>>>>>
>>>>> LV: Checking a loop in "bar"
>>>>> LV: Found a loop: L0
>>>>> LV: Found an induction variable.
>>>>> LV: Found an unidentified write ptr:   %7 = load float** %6
>>>>> LV: Found an unidentified read ptr:   %10 = load float** %9
>>>>> LV: Found an unidentified read ptr:   %13 = load float** %12
>>>>> LV: We need to do 2 pointer comparisons.
>>>>> LV: We can't vectorize because we can't find the array bounds.
>>>>> LV: Can't vectorize due to memory conflicts
>>>>> LV: Not vectorizing.
>>>>>
>>>>> It can't find the loop bounds if we use the overflow version of
>>>>> add.
>>>>> That's a good point. I should mark this addition to not overflow.
>>>>>
>>>>> When using the non-overflow version I get:
>>>>>
>>>>> LV: Checking a loop in "bar"
>>>>> LV: Found a loop: L0
>>>>> LV: Found an induction variable.
>>>>> LV: Found an unidentified write ptr:   %7 = load float** %6
>>>>> LV: Found an unidentified read ptr:   %10 = load float** %9
>>>>> LV: Found an unidentified read ptr:   %13 = load float** %12
>>>>> LV: Found a runtime check ptr:  %20 = getelementptr float* %7,
>>>>> i32
>>>>> %14
>>>>> LV: Found a runtime check ptr:  %15 = getelementptr float* %10,
>>>>> i32
>>>>> %14
>>>>> LV: Found a runtime check ptr:  %17 = getelementptr float* %13,
>>>>> i32
>>>>> %14
>>>>> LV: We need to do 2 pointer comparisons.
>>>>> LV: We can perform a memory runtime check if needed.
>>>>> LV: We need a runtime memory check.
>>>>> LV: We can vectorize this loop (with a runtime bound check)!
>>>>> LV: Found trip count: 0
>>>>> LV: The Widest type: 32 bits.
>>>>> LV: The Widest register is: 32 bits.
>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %14
>>>>> =
>>>>> phi
>>>>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %15
>>>>> =
>>>>> getelementptr float* %10, i32 %14
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %16
>>>>> =
>>>>> load
>>>>> float* %15
>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %17
>>>>> =
>>>>> getelementptr float* %13, i32 %14
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %18
>>>>> =
>>>>> load
>>>>> float* %17
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %19
>>>>> =
>>>>> fmul
>>>>> float %18, %16
>>>>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %20
>>>>> =
>>>>> getelementptr float* %7, i32 %14
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:
>>>>>    store
>>>>> float
>>>>> %19, float* %20
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %21
>>>>> =
>>>>> add
>>>>> nsw i32 %14, 1
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %22
>>>>> =
>>>>> icmp
>>>>> sge i32 %21, %4
>>>>> LV: Found an estimated cost of 1 for VF 1 For instruction:   br
>>>>> i1
>>>>> %22,
>>>>> label %L1, label %L0
>>>>> LV: Scalar loop costs: 7.
>>>>> LV: Selecting VF = : 1.
>>>>> LV: The target has 8 vector registers
>>>>> LV(REG): Calculating max register usage:
>>>>> LV(REG): At #0 Interval # 0
>>>>> LV(REG): At #1 Interval # 1
>>>>> LV(REG): At #2 Interval # 2
>>>>> LV(REG): At #3 Interval # 2
>>>>> LV(REG): At #4 Interval # 3
>>>>> LV(REG): At #5 Interval # 3
>>>>> LV(REG): At #6 Interval # 2
>>>>> LV(REG): At #8 Interval # 1
>>>>> LV(REG): At #9 Interval # 1
>>>>> LV(REG): Found max usage: 3
>>>>> LV(REG): Found invariant usage: 5
>>>>> LV(REG): LoopSize: 11
>>>>> LV: Vectorization is possible but not beneficial.
>>>>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
>>>>> LV: Unroll Factor is 1
>>>>>
>>>>> It's not beneficial? I didn't expect that. Do you have a
>>>>> descriptive
>>>>> explanation why it's not beneficial?
>>>> It looks like the vectorizer is not picking up a TTI
>>>> implementation from a target with vector registers (likely,
>>>> you're just seeing the basic cost model). For what target is
>>>> this?
>>>>
>>>>   -Hal
>>>>
>>>>> Frank
>>>>>
>>>>>
>>>>>
>>>>> On 26/10/13 13:03, Arnold wrote:
>>>>>> Hi Frank,
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>> My function implements a simple loop:
>>>>>>>
>>>>>>> void bar( int start, int end, float* A, float* B, float* C)
>>>>>>> {
>>>>>>>      for (int i=start; i<end;++i)
>>>>>>>         A[i] = B[i] * C[i];
>>>>>>> }
>>>>>>>
>>>>>>> This looks pretty much like the standard example. However, I
>>>>>>> built
>>>>>>> the function
>>>>>>> with the IRBuilder, thus not coming from C and clang. Also I
>>>>>>> changed slightly
>>>>>>> the function's signature:
>>>>>>>
>>>>>>> define void @bar([8 x i8]* %arg_ptr) {
>>>>>>> entrypoint:
>>>>>>>    %0 = bitcast [8 x i8]* %arg_ptr to i32*
>>>>>>>    %1 = load i32* %0
>>>>>>>    %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
>>>>>>>    %3 = bitcast [8 x i8]* %2 to i32*
>>>>>>>    %4 = load i32* %3
>>>>>>>    %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
>>>>>>>    %6 = bitcast [8 x i8]* %5 to float**
>>>>>>>    %7 = load float** %6
>>>>>>>    %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
>>>>>>>    %9 = bitcast [8 x i8]* %8 to float**
>>>>>>>    %10 = load float** %9
>>>>>>>    %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
>>>>>>>    %12 = bitcast [8 x i8]* %11 to float**
>>>>>>>    %13 = load float** %12
>>>>>>>    br label %L0
>>>>>>>
>>>>>>> L0:                                               ; preds =
>>>>>>> %L0,
>>>>>>> %entrypoint
>>>>>>>    %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
>>>>>>>    %15 = getelementptr float* %10, i32 %14
>>>>>>>    %16 = load float* %15
>>>>>>>    %17 = getelementptr float* %13, i32 %14
>>>>>>>    %18 = load float* %17
>>>>>>>    %19 = fmul float %18, %16
>>>>>>>    %20 = getelementptr float* %7, i32 %14
>>>>>>>    store float %19, float* %20
>>>>>>>    %21 = add i32 %14, 1
>>>>>> Try
>>>>>> %21 = add nsw i32 %14, 1
>>>>>> instead for no-signed wrapping arithmetic.
>>>>>>
>>>>>> If that is not working please post the output of opt ...
>>>>>> -debug-only=loop-vectorize ...
>>>>>>
>>>>>>
>>>>>>
>>>>>>>    %22 = icmp sge i32 %21, %4
>>>>>>>    br i1 %22, label %L1, label %L0
>>>>>>>
>>>>>>> L1:                                               ; preds = %L0
>>>>>>>    ret void
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> As you can see, I use the phi instruction for the loop index. I
>>>>>>> notice
>>>>>>> that clang prefers stack allocation. So, I am not sure what's
>>>>>>> the
>>>>>>> problem that the loop vectorizer is not working here.
>>>>>>> I tried many things, like specifying an architecture with
>>>>>>> vector
>>>>>>> units, enforcing the vector width. No success.
>>>>>>>
>>>>>>> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S
>>>>>>> loop.ll
>>>>>>>
>>>>>>> The only explanation I have is the use of the phi instruction.
>>>>>>> Is
>>>>>>> this
>>>>>>> preventing to vectorize the loop?
>>>>>>>
>>>>>>> Frank
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>
>>>
>>


-- 
-----------------------------------------------------------
Dr Frank Winter
Scientific Computing Group
Jefferson Lab, 12000 Jefferson Ave, CEBAF Centre, Room F216
Newport News, VA 23606, USA
Tel: +1-757-269-6448
EMail: fwinter at jlab.org
-----------------------------------------------------------




More information about the llvm-dev mailing list