[LLVMdev] Why is the loop vectorizer not working on my function?

Sat Oct 26 08:03:29 PDT 2013

My function implements a simple loop:

void bar( int start, int end, float* A, float* B, float* C)
{
     for (int i=start; i<end;++i)
        A[i] = B[i] * C[i];
}

This looks pretty much like the standard example. However, I built the 
function
with the IRBuilder, thus not coming from C and clang. Also I changed 
slightly
the function's signature:

define void @bar([8 x i8]* %arg_ptr) {
entrypoint:
   %0 = bitcast [8 x i8]* %arg_ptr to i32*
   %1 = load i32* %0
   %2 = getelementptr [8 x i8]* %arg_ptr, i32 1
   %3 = bitcast [8 x i8]* %2 to i32*
   %4 = load i32* %3
   %5 = getelementptr [8 x i8]* %arg_ptr, i32 2
   %6 = bitcast [8 x i8]* %5 to float**
   %7 = load float** %6
   %8 = getelementptr [8 x i8]* %arg_ptr, i32 3
   %9 = bitcast [8 x i8]* %8 to float**
   %10 = load float** %9
   %11 = getelementptr [8 x i8]* %arg_ptr, i32 4
   %12 = bitcast [8 x i8]* %11 to float**
   %13 = load float** %12
   br label %L0

L0:                                               ; preds = %L0, %entrypoint
   %14 = phi i32 [ %21, %L0 ], [ %1, %entrypoint ]
   %15 = getelementptr float* %10, i32 %14
   %16 = load float* %15
   %17 = getelementptr float* %13, i32 %14
   %18 = load float* %17
   %19 = fmul float %18, %16
   %20 = getelementptr float* %7, i32 %14
   store float %19, float* %20
   %21 = add i32 %14, 1
   %22 = icmp sge i32 %21, %4
   br i1 %22, label %L1, label %L0

L1:                                               ; preds = %L0
   ret void
}

As you can see, I use the phi instruction for the loop index. I notice
that clang prefers stack allocation. So, I am not sure what's the
problem that the loop vectorizer is not working here.
I tried many things, like specifying an architecture with vector
units, enforcing the vector width. No success.

opt -march=x64-64 -loop-vectorize -force-vector-width=8 -S loop.ll

The only explanation I have is the use of the phi instruction. Is this
preventing to vectorize the loop?

Frank