[LLVMdev] Why is the loop vectorizer not working on my function?
Hal Finkel
hfinkel at anl.gov
Sat Oct 26 12:08:28 PDT 2013
----- Original Message -----
> Hi Arnold,
>
> adding '-debug-only=loop-vectorize' to the command gives:
>
> LV: Checking a loop in "bar"
> LV: Found a loop: L0
> LV: Found an induction variable.
> LV: Found an unidentified write ptr: %7 = load float** %6
> LV: Found an unidentified read ptr: %10 = load float** %9
> LV: Found an unidentified read ptr: %13 = load float** %12
> LV: We need to do 2 pointer comparisons.
> LV: We can't vectorize because we can't find the array bounds.
> LV: Can't vectorize due to memory conflicts
> LV: Not vectorizing.
>
> It can't find the loop bounds if we use the overflow version of add.
> That's a good point. I should mark this addition to not overflow.
>
> When using the non-overflow version I get:
>
> LV: Checking a loop in "bar"
> LV: Found a loop: L0
> LV: Found an induction variable.
> LV: Found an unidentified write ptr: %7 = load float** %6
> LV: Found an unidentified read ptr: %10 = load float** %9
> LV: Found an unidentified read ptr: %13 = load float** %12
> LV: Found a runtime check ptr: %20 = getelementptr float* %7, i32
> %14
> LV: Found a runtime check ptr: %15 = getelementptr float* %10, i32
> %14
> LV: Found a runtime check ptr: %17 = getelementptr float* %13, i32
> %14
> LV: We need to do 2 pointer comparisons.
> LV: We can perform a memory runtime check if needed.
> LV: We need a runtime memory check.
> LV: We can vectorize this loop (with a runtime bound check)!
> LV: Found trip count: 0
> LV: The Widest type: 32 bits.
> LV: The Widest register is: 32 bits.
> LV: Found an estimated cost of 0 for VF 1 For instruction: %14 =
> phi
> i32 [ %21, %L0 ], [ %1, %entrypoint ]
> LV: Found an estimated cost of 0 for VF 1 For instruction: %15 =
> getelementptr float* %10, i32 %14
> LV: Found an estimated cost of 1 for VF 1 For instruction: %16 =
> load
> float* %15
> LV: Found an estimated cost of 0 for VF 1 For instruction: %17 =
> getelementptr float* %13, i32 %14
> LV: Found an estimated cost of 1 for VF 1 For instruction: %18 =
> load
> float* %17
> LV: Found an estimated cost of 1 for VF 1 For instruction: %19 =
> fmul
> float %18, %16
> LV: Found an estimated cost of 0 for VF 1 For instruction: %20 =
> getelementptr float* %7, i32 %14
> LV: Found an estimated cost of 1 for VF 1 For instruction: store
> float
> %19, float* %20
> LV: Found an estimated cost of 1 for VF 1 For instruction: %21 =
> add
> nsw i32 %14, 1
> LV: Found an estimated cost of 1 for VF 1 For instruction: %22 =
> icmp
> sge i32 %21, %4
> LV: Found an estimated cost of 1 for VF 1 For instruction: br i1
> %22,
> label %L1, label %L0
> LV: Scalar loop costs: 7.
> LV: Selecting VF = : 1.
> LV: The target has 8 vector registers
> LV(REG): Calculating max register usage:
> LV(REG): At #0 Interval # 0
> LV(REG): At #1 Interval # 1
> LV(REG): At #2 Interval # 2
> LV(REG): At #3 Interval # 2
> LV(REG): At #4 Interval # 3
> LV(REG): At #5 Interval # 3
> LV(REG): At #6 Interval # 2
> LV(REG): At #8 Interval # 1
> LV(REG): At #9 Interval # 1
> LV(REG): Found max usage: 3
> LV(REG): Found invariant usage: 5
> LV(REG): LoopSize: 11
> LV: Vectorization is possible but not beneficial.
> LV: Found a vectorizable loop (1) in saxpy_real.gvn.mod.ll
> LV: Unroll Factor is 1
>
> It's not beneficial? I didn't expect that. Do you have a descriptive
> explanation why it's not beneficial?
It looks like the vectorizer is not picking up a TTI implementation from a target with vector registers (likely, you're just seeing the basic cost model). For what target is this?
-Hal
>
> Frank
>
>
>
> On 26/10/13 13:03, Arnold wrote:
> > Hi Frank,
> >
> > Sent from my iPhone
> >
> >> On Oct 26, 2013, at 10:03 AM, Frank Winter <fwinter at jlab.org>
> >> wrote:
> >>
> >> My function implements a simple loop:
> >>
> >> void bar( int start, int end, float* A, float* B, float* C)
> >> {
> >> for (int i=start; i<end;++i)
> >> A[i] = B[i] * C[i];
> >> }
> >>
> >> This looks pretty much like the standard example. However, I built
> >> the function
> >> with the IRBuilder, thus not coming from C and clang. Also I
> >> changed slightly
> >> the function's signature:
> >>
> >> define void @bar([8 x i8]* %arg_ptr) {
> >> entrypoint:
> >> %0 = bitcast [8 x i8]* %arg_ptr to i32*
> >> %1 = load i32* %0
> >> %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
> >> %3 = bitcast [8 x i8]* %2 to i32*
> >> %4 = load i32* %3
> >> %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
> >> %6 = bitcast [8 x i8]* %5 to float**
> >> %7 = load float** %6
> >> %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
> >> %9 = bitcast [8 x i8]* %8 to float**
> >> %10 = load float** %9
> >> %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
> >> %12 = bitcast [8 x i8]* %11 to float**
> >> %13 = load float** %12
> >> br label %L0
> >>
> >> L0: ; preds = %L0,
> >> %entrypoint
> >> %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
> >> %15 = getelementptr float* %10, i32 %14
> >> %16 = load float* %15
> >> %17 = getelementptr float* %13, i32 %14
> >> %18 = load float* %17
> >> %19 = fmul float %18, %16
> >> %20 = getelementptr float* %7, i32 %14
> >> store float %19, float* %20
> >> %21 = add i32 %14, 1
> > Try
> > %21 = add nsw i32 %14, 1
> > instead for no-signed wrapping arithmetic.
> >
> > If that is not working please post the output of opt ...
> > -debug-only=loop-vectorize ...
> >
> >
> >
> >> %22 = icmp sge i32 %21, %4
> >> br i1 %22, label %L1, label %L0
> >>
> >> L1: ; preds = %L0
> >> ret void
> >> }
> >>
> >>
> >> As you can see, I use the phi instruction for the loop index. I
> >> notice
> >> that clang prefers stack allocation. So, I am not sure what's the
> >> problem that the loop vectorizer is not working here.
> >> I tried many things, like specifying an architecture with vector
> >> units, enforcing the vector width. No success.
> >>
> >> opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S loop.ll
> >>
> >> The only explanation I have is the use of the phi instruction. Is
> >> this
> >> preventing to vectorize the loop?
> >>
> >> Frank
> >>
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
More information about the llvm-dev
mailing list