[LLVMdev] loop vectorizer: this loop is not worth vectorizing

Thu Oct 31 20:41:52 PDT 2013

In the case when coming from C it was probably the loop unroller and SLP 
vectorizer which vectorized the code. Potentially I could do the same in 
the IR. However, the loop body that is generated in the IR can get very 
large. Thus, the loop unroller will refuse to unroll the loop in a large 
number of (important) cases.

Isn't there a way to convince the loop vectorizer that it should 
vectorize a loop even when its trip count equals the SIMD vector length?

Frank

On 31/10/13 23:27, Frank Winter wrote:
> I am trying a setup where the one loop is rewritten as two loops. This 
> avoids the 'rem' and 'div' instructions in the index calculation 
> (which give the loop vectorizer a hard time).
>
> However, with this setup the loop vectorizer complains about a too 
> small loop.
>
> LV: Checking a loop in "main"
> LV: Found a loop: L3
> LV: Found a loop with a very small trip count. This loop is not worth 
> vectorizing.
> LV: Not vectorizing.
>
> Here the IR:
>
> define void @main(i64 %arg0, i64 %arg1, i1 %arg2, i64 %arg3, float* 
> noalias %arg4, float* noalias %arg5, float* noalias %arg6) {
> entrypoint:
>   br i1 %arg2, label %L0, label %L2
>
> L0:                                               ; preds = %entrypoint
>   %0 = add nsw i64 %arg0, %arg3
>   %1 = add nsw i64 %arg1, %arg3
>   br label %L2
>
> L2:                                               ; preds = 
> %entrypoint, %L0
>   %2 = phi i64 [ %0, %L0 ], [ %arg0, %entrypoint ]
>   %3 = phi i64 [ %1, %L0 ], [ %arg1, %entrypoint ]
>   %4 = sdiv i64 %2, 4
>   %5 = sdiv i64 %3, 4
>   br label %L5
>
> L3:                                               ; preds = %L3, %L5
>   %6 = phi i64 [ %21, %L3 ], [ 0, %L5 ]
>   %7 = add nsw i64 %26, %6
>   %8 = add nsw i64 %27, %6
>   %9 = getelementptr float* %arg5, i64 %7
>   %10 = load float* %9, align 4
>   %11 = getelementptr float* %arg5, i64 %8
>   %12 = load float* %11, align 4
>   %13 = getelementptr float* %arg6, i64 %7
>   %14 = load float* %13, align 4
>   %15 = getelementptr float* %arg6, i64 %8
>   %16 = load float* %15, align 4
>   %17 = fadd float %16, %12
>   %18 = fadd float %14, %10
>   %19 = getelementptr float* %arg4, i64 %7
>   store float %18, float* %19, align 4
>   %20 = getelementptr float* %arg4, i64 %8
>   store float %17, float* %20, align 4
>   %21 = add nsw i64 %6, 1
>   %22 = icmp sgt i64 %6, 2
>   br i1 %22, label %L4, label %L3
>
> L4:                                               ; preds = %L3
>   %23 = add nsw i64 %25, 1
>   %24 = icmp slt i64 %23, %5
>   br i1 %24, label %L5, label %L6
>
> L5:                                               ; preds = %L4, %L2
>   %25 = phi i64 [ %23, %L4 ], [ %4, %L2 ]
>   %26 = shl i64 %25, 3
>   %27 = or i64 %26, 4
>   br label %L3
>
> L6:                                               ; preds = %L4
>   ret void
> }
>
>
> The L3 loop has a trip count of 4. The L5 outer loop has a variable 
> trip count depending on the functions arguments.
>
> I cannot make the L3 loop larger so that the vectorizer might be 
> happy, because this will again introduce 'rem' and 'div' in the index 
> calculation.
>
> I am using these passes:
>
> functionPassManager->add(llvm::createBasicAliasAnalysisPass());
>       functionPassManager->add(llvm::createLICMPass());
>       functionPassManager->add(llvm::createGVNPass());
> functionPassManager->add(llvm::createLoopVectorizePass());
> functionPassManager->add(llvm::createInstructionCombiningPass());
>       functionPassManager->add(llvm::createEarlyCSEPass());
> functionPassManager->add(llvm::createCFGSimplificationPass());
>
> I am wondering, whether there might be pass I could issue before the 
> loop vectorizer that transforms the code so that the vectorizer is 
> happy. I am wondering because coming from a C function which tries to 
> mimic the above IR
>
> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__  
> c, float * __restrict__ a, float * __restrict__ b)
> {
>   const std::uint64_t inner = 4;
>   for (std::uint64_t i = start/inner ; i < end/inner ; i++ ) {
>     for (std::uint64_t q = 0 ; q < inner ; q++ ) {
>       const std::uint64_t ir0 = ( i * 2 + 0 ) * inner + q;
>       const std::uint64_t ir1 = ( i * 2 + 1 ) * inner + q;
>
>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>       c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];
>     }
>   }
> }
>
> the loop vectorizer complains as well, but the produced code is 
> vectorized:
>
> LV: Checking a loop in "_Z3barmmPfS_S_"
> LV: Found a loop: for.body4
> LV: Found an induction variable.
> LV: Found unvectorizable type.
> LV: Can't vectorize the instructions or CFG
> LV: Not vectorizing.
>
> ; Function Attrs: nounwind uwtable
> define void @_Z3barmmPfS_S_(i64 %start, i64 %end, float* noalias %c, 
> float* noalias %a, float* noalias %b) #3 {
> entry:
>   %div = lshr i64 %start, 2
>   %div1 = lshr i64 %end, 2
>   %cmp9 = icmp ult i64 %div, %div1
>   br i1 %cmp9, label %for.body4.preheader, label %for.end20
>
> for.body4.preheader:                              ; preds = %entry
>   br label %for.body4
>
> for.body4:                                        ; preds = 
> %for.body4.preheader, %for.body4
>   %storemerge10 = phi i64 [ %inc19, %for.body4 ], [ %div, 
> %for.body4.preheader ]
>   %mul5 = shl i64 %storemerge10, 3
>   %add82 = or i64 %mul5, 4
>   %arrayidx = getelementptr inbounds float* %a, i64 %mul5
>   %arrayidx11 = getelementptr inbounds float* %b, i64 %mul5
>   %arrayidx13 = getelementptr inbounds float* %c, i64 %mul5
>   %arrayidx14 = getelementptr inbounds float* %a, i64 %add82
>   %arrayidx15 = getelementptr inbounds float* %b, i64 %add82
>   %arrayidx17 = getelementptr inbounds float* %c, i64 %add82
>   %0 = bitcast float* %arrayidx to <4 x float>*
>   %1 = load <4 x float>* %0, align 4
>   %2 = bitcast float* %arrayidx11 to <4 x float>*
>   %3 = load <4 x float>* %2, align 4
>   %4 = fadd <4 x float> %1, %3
>   %5 = bitcast float* %arrayidx13 to <4 x float>*
>   store <4 x float> %4, <4 x float>* %5, align 4
>   %6 = bitcast float* %arrayidx14 to <4 x float>*
>   %7 = load <4 x float>* %6, align 4
>   %8 = bitcast float* %arrayidx15 to <4 x float>*
>   %9 = load <4 x float>* %8, align 4
>   %10 = fadd <4 x float> %7, %9
>   %11 = bitcast float* %arrayidx17 to <4 x float>*
>   store <4 x float> %10, <4 x float>* %11, align 4
>   %inc19 = add i64 %storemerge10, 1
>   %cmp = icmp ult i64 %inc19, %div1
>   br i1 %cmp, label %for.body4, label %for.end20.loopexit
>
> for.end20.loopexit:                               ; preds = %for.body4
>   br label %for.end20
>
> for.end20:                                        ; preds = 
> %for.end20.loopexit, %entry
>   ret void
> }
>
>
> But here the vectorization must have happened before. It's starting to 
> get frustrating.
>
> Frank
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev