[LLVMdev] Why is the loop vectorizer not working on my function?
Hal Finkel
hfinkel at anl.gov
Sat Oct 26 12:54:46 PDT 2013
----- Original Message -----
> >>> LV: The Widest type: 32 bits.
> >>> LV: The Widest register is: 32 bits.
>
> Yep, we don’t pick up the right TTI.
>
> Try -march=x86-64 (or leave it out) you already have this info in the
> triple.
>
> Then it should work (does for me with your example below).
That may depend on what CPU is picks by default; Frank, if it does not work for you, try specifying a target CPU (-mcpu=whatever).
-Hal
>
>
> On Oct 26, 2013, at 2:16 PM, Frank Winter <fwinter at jlab.org> wrote:
>
> > Hi Hal!
> >
> > I am using the 'x86_64' target. Below the complete module dump and
> > here the command line:
> >
> > opt -march=x64-64 -loop-vectorize -debug-only=loop-vectorize -S
> > test.ll
> >
> > Frank
> >
> >
> > ; ModuleID = 'test.ll'
> >
> > target datalayout =
> > "e-p:64:64:64-S128-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f16:16:16-f32:32:32-f64:64:64-f128:128:128-v64:64:64-v128:12
> > 8:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
> >
> > target triple = "x86_64-unknown-linux-elf"
> >
> > define void @bar([8 x i8]* %arg_ptr) {
> > entrypoint:
> > %0 = bitcast [8 x i8]* %arg_ptr to i32*
> > %1 = load i32* %0
> > %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
> > %3 = bitcast [8 x i8]* %2 to i32*
> > %4 = load i32* %3
> > %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
> > %6 = bitcast [8 x i8]* %5 to float**
> > %7 = load float** %6
> > %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
> > %9 = bitcast [8 x i8]* %8 to float**
> > %10 = load float** %9
> > %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
> > %12 = bitcast [8 x i8]* %11 to float**
> > %13 = load float** %12
> > br label %L0
> >
> > L0: ; preds = %L0,
> > %entrypoint
> > %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
> > %15 = getelementptr float* %10, i32 %14
> > %16 = load float* %15
> > %17 = getelementptr float* %13, i32 %14
> > %18 = load float* %17
> > %19 = fmul float %18, %16
> > %20 = getelementptr float* %7, i32 %14
> > store float %19, float* %20
> > %21 = add nsw i32 %14, 1
> > %22 = icmp sge i32 %21, %4
> > br i1 %22, label %L1, label %L0
> >
> > L1: ; preds = %L0
> > ret void
> > }
> >
> >
> >
> > On 26/10/13 15:08, Hal Finkel wrote:
> >> ----- Original Message -----
> >>> Hi Arnold,
> >>>
> >>> adding '-debug-only=loop-vectorize' to the command gives:
> >>>
> >>> LV: Checking a loop in "bar"
> >>> LV: Found a loop: L0
> >>> LV: Found an induction variable.
> >>> LV: Found an unidentified write ptr: %7 = load float** %6
> >>> LV: Found an unidentified read ptr: %10 = load float** %9
> >>> LV: Found an unidentified read ptr: %13 = load float** %12
> >>> LV: We need to do 2 pointer comparisons.
> >>> LV: We can't vectorize because we can't find the array bounds.
> >>> LV: Can't vectorize due to memory conflicts
> >>> LV: Not vectorizing.
> >>>
> >>> It can't find the loop bounds if we use the overflow version of
> >>> add.
> >>> That's a good point. I should mark this addition to not overflow.
> >>>
> >>> When using the non-overflow version I get:
> >>>
> >>> LV: Checking a loop in "bar"
> >>> LV: Found a loop: L0
> >>> LV: Found an induction variable.
> >>> LV: Found an unidentified write ptr: %7 = load float** %6
> >>> LV: Found an unidentified read ptr: %10 = load float** %9
> >>> LV: Found an unidentified read ptr: %13 = load float** %12
> >>> LV: Found a runtime check ptr: %20 = getelementptr float* %7,
> >>> i32
> >>> %14
> >>> LV: Found a runtime check ptr: %15 = getelementptr float* %10,
> >>> i32
> >>> %14
> >>> LV: Found a runtime check ptr: %17 = getelementptr float* %13,
> >>> i32
> >>> %14
> >>> LV: We need to do 2 pointer comparisons.
> >>> LV: We can perform a memory runtime check if needed.
> >>> LV: We need a runtime memory check.
> >>> LV: We can vectorize this loop (with a runtime bound check)!
> >>> LV: Found trip count: 0
> >>> LV: The Widest type: 32 bits.
> >>> LV: The Widest register is: 32 bits.
> >>> LV: Found an estimated cost of 0 for VF 1 For instruction: %14
> >>> =
> >>> phi
> >>> i32 [ %21, %L0 ], [ %1, %entrypoint ]
> >>> LV: Found an estimated cost of 0 for VF 1 For instruction: %15
> >>> =
> >>> getelementptr float* %10, i32 %14
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction: %16
> >>> =
> >>> load
> >>> float* %15
> >>> LV: Found an estimated cost of 0 for VF 1 For instruction: %17
> >>> =
> >>> getelementptr float* %13, i32 %14
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction: %18
> >>> =
> >>> load
> >>> float* %17
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction: %19
> >>> =
> >>> fmul
> >>> float %18, %16
> >>> LV: Found an estimated cost of 0 for VF 1 For instruction: %20
> >>> =
> >>> getelementptr float* %7, i32 %14
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction:
> >>> store
> >>> float
> >>> %19, float* %20
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction: %21
> >>> =
> >>> add
> >>> nsw i32 %14, 1
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction: %22
> >>> =
> >>> icmp
> >>> sge i32 %21, %4
> >>> LV: Found an estimated cost of 1 for VF 1 For instruction: br
> >>> i1
> >>> %22,
> >>> label %L1, label %L0
> >>> LV: Scalar loop costs: 7.
> >>> LV: Selecting VF = : 1.
> >>> LV: The target has 8 vector registers
> >>> LV(REG): Calculating max register usage:
> >>> LV(REG): At #0 Interval # 0
> >>> LV(REG): At #1 Interval # 1
> >>> LV(REG): At #2 Interval # 2
> >>> LV(REG): At #3 Interval # 2
> >>> LV(REG): At #4 Interval # 3
> >>> LV(REG): At #5 Interval # 3
> >>> LV(REG): At #6 Interval # 2
> >>> LV(REG): At #8 Interval # 1
> >>> LV(REG): At #9 Interval # 1
> >>> LV(REG): Found max usage: 3
> >>> LV(REG): Found invariant usage: 5
> >>> LV(REG): LoopSize: 11
> >>> LV: Vectorization is possible but not beneficial.
> >>> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
> >>> LV: Unroll Factor is 1
> >>>
> >>> It's not beneficial? I didn't expect that. Do you have a
> >>> descriptive
> >>> explanation why it's not beneficial?
> >> It looks like the vectorizer is not picking up a TTI
> >> implementation from a target with vector registers (likely,
> >> you're just seeing the basic cost model). For what target is
> >> this?
> >>
> >> -Hal
> >>
> >>> Frank
> >>>
> >>>
> >>>
> >>> On 26/10/13 13:03, Arnold wrote:
> >>>> Hi Frank,
> >>>>
> >>>> Sent from my iPhone
> >>>>
> >>>>> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org>
> >>>>> wrote:
> >>>>>
> >>>>> My function implements a simple loop:
> >>>>>
> >>>>> void bar( int start, int end, float* A, float* B, float* C)
> >>>>> {
> >>>>> for (int i=start; i<end;++i)
> >>>>> A[i] = B[i] * C[i];
> >>>>> }
> >>>>>
> >>>>> This looks pretty much like the standard example. However, I
> >>>>> built
> >>>>> the function
> >>>>> with the IRBuilder, thus not coming from C and clang. Also I
> >>>>> changed slightly
> >>>>> the function's signature:
> >>>>>
> >>>>> define void @bar([8 x i8]* %arg_ptr) {
> >>>>> entrypoint:
> >>>>> %0 = bitcast [8 x i8]* %arg_ptr to i32*
> >>>>> %1 = load i32* %0
> >>>>> %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
> >>>>> %3 = bitcast [8 x i8]* %2 to i32*
> >>>>> %4 = load i32* %3
> >>>>> %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
> >>>>> %6 = bitcast [8 x i8]* %5 to float**
> >>>>> %7 = load float** %6
> >>>>> %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
> >>>>> %9 = bitcast [8 x i8]* %8 to float**
> >>>>> %10 = load float** %9
> >>>>> %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
> >>>>> %12 = bitcast [8 x i8]* %11 to float**
> >>>>> %13 = load float** %12
> >>>>> br label %L0
> >>>>>
> >>>>> L0: ; preds =
> >>>>> %L0,
> >>>>> %entrypoint
> >>>>> %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
> >>>>> %15 = getelementptr float* %10, i32 %14
> >>>>> %16 = load float* %15
> >>>>> %17 = getelementptr float* %13, i32 %14
> >>>>> %18 = load float* %17
> >>>>> %19 = fmul float %18, %16
> >>>>> %20 = getelementptr float* %7, i32 %14
> >>>>> store float %19, float* %20
> >>>>> %21 = add i32 %14, 1
> >>>> Try
> >>>> %21 = add nsw i32 %14, 1
> >>>> instead for no-signed wrapping arithmetic.
> >>>>
> >>>> If that is not working please post the output of opt ...
> >>>> -debug-only=loop-vectorize ...
> >>>>
> >>>>
> >>>>
> >>>>> %22 = icmp sge i32 %21, %4
> >>>>> br i1 %22, label %L1, label %L0
> >>>>>
> >>>>> L1: ; preds = %L0
> >>>>> ret void
> >>>>> }
> >>>>>
> >>>>>
> >>>>> As you can see, I use the phi instruction for the loop index. I
> >>>>> notice
> >>>>> that clang prefers stack allocation. So, I am not sure what's
> >>>>> the
> >>>>> problem that the loop vectorizer is not working here.
> >>>>> I tried many things, like specifying an architecture with
> >>>>> vector
> >>>>> units, enforcing the vector width. No success.
> >>>>>
> >>>>> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S
> >>>>> loop.ll
> >>>>>
> >>>>> The only explanation I have is the use of the phi instruction.
> >>>>> Is
> >>>>> this
> >>>>> preventing to vectorize the loop?
> >>>>>
> >>>>> Frank
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> LLVM Developers mailing list
> >>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >>>
> >>> _______________________________________________
> >>> LLVM Developers mailing list
> >>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >>>
> >
> >
>
>
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
More information about the llvm-dev
mailing list