[LLVMdev] loop vectorizer

Wed Oct 30 11:15:48 PDT 2013

The debug messages are misleading. They should read “trying to vectorize a list of …”;   The problem is that the SCEV analysis is unable to detect that C[ir0] and C[ir1] are consecutive.  Is this loop from an important benchmark ?

Thanks,
Nadav

On Oct 30, 2013, at 11:13 AM, Frank Winter <fwinter at jlab.org> wrote:

> The SLP vectorizer apparently did something in the prologue of the function (where storing of arguments on the stack happens) which then got eliminated later on (since I don't see any vector instructions in the final IR). Below the debug output of the SLP pass:
> 
> Args: opt -O1 -vectorize-slp -debug loop.ll -S
> SLP: Analyzing blocks in _Z3barmmPfS_S_.
> SLP: Found 2 stores to vectorize.
> SLP: Analyzing a store chain of length 2.
> SLP: Trying to vectorize starting at PHIs (1)
> SLP: Vectorizing a list of length = 2.
> SLP: Vectorizing a list of length = 2.
> SLP: Vectorizing a list of length = 2.
> 
> IR produced:
> 
> define void @_Z3barmmPfS_S_(i64 %start, i64 %end, float* noalias nocapture %c, float* noalias nocapture readonly %a, float* noalias nocapture readonly %b) #3 {
> entry:
>  %cmp14 = icmp ult i64 %start, %end
>  br i1 %cmp14, label %for.body, label %for.end
> 
> for.body:                                         ; preds = %entry, %for.body
>  %i.015 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
>  %div = lshr i64 %i.015, 2
>  %mul = shl i64 %div, 3
>  %rem = and i64 %i.015, 3
>  %add2 = or i64 %mul, %rem
>  %add8 = or i64 %add2, 4
>  %arrayidx = getelementptr inbounds float* %a, i64 %add2
>  %0 = load float* %arrayidx, align 4
>  %arrayidx9 = getelementptr inbounds float* %b, i64 %add2
>  %1 = load float* %arrayidx9, align 4
>  %add10 = fadd float %0, %1
>  %arrayidx11 = getelementptr inbounds float* %c, i64 %add2
>  store float %add10, float* %arrayidx11, align 4
>  %arrayidx12 = getelementptr inbounds float* %a, i64 %add8
>  %2 = load float* %arrayidx12, align 4
>  %arrayidx13 = getelementptr inbounds float* %b, i64 %add8
>  %3 = load float* %arrayidx13, align 4
>  %add14 = fadd float %2, %3
>  %arrayidx15 = getelementptr inbounds float* %c, i64 %add8
>  store float %add14, float* %arrayidx15, align 4
>  %inc = add i64 %i.015, 1
>  %exitcond = icmp eq i64 %inc, %end
>  br i1 %exitcond, label %for.end, label %for.body
> 
> for.end:                                          ; preds = %for.body, %entry
>  ret void
> }
> 
> 
> 
> Does this finally mean the current LLVM cannot vectorize the function?:
> 
> 
> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__  c, float * __restrict__ a, float * __restrict__ b)
> {
>  const std::uint64_t inner = 4;
>  for (std::uint64_t i = start ; i < end ; ++i ) {
>    const std::uint64_t ir0 = ( (i/inner) * 2 + 0 ) * inner + i%4;
>    const std::uint64_t ir1 = ( (i/inner) * 2 + 1 ) * inner + i%4;
>    c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>    c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];
>  }
> }
> 
> I was trying the following:
> 
> clang++ -emit-llvm -S loop.cc -std=c++11
> (this writes loop.ll)
> 
> opt -O1 -bb-vectorize -debug loop.ll -S
> (I found the -O1 useful to prepare the IR for the loop vectorizer, so I hoped this would work here as well)
> 
> opt -O1 -loop-vectorize -debug-only=loop-vectorize loop.ll -S
> 
> opt -loop-unroll -unroll-allow-partial -debug loop.ll -S
> (this didn't unroll the loop.) Here the output (relevant for loop-unroll)
> Loop Unroll: F[_Z3barmmPfS_S_] Loop %for.cond
> 
> Are there any other combinations of passes that I might try?
> 
> Frank
> 
> 
> 
> On 30/10/13 13:50, Hal Finkel wrote:
>> ----- Original Message -----
>>> 
>>> I ran the BB vectorizer as I guess this is the SLP vectorizer.
>> No, while the BB vectorizer is doing a form of SLP vectorization, there is a separate SLP vectorization pass which uses a different algorithm. You can pass -vectorize-slp to opt.
>> 
>>  -Hal
>> 
>>> BBV: using target information
>>> BBV: fusing loop #1 for for.body in _Z3barmmPfS_S_...
>>> BBV: found 2 instructions with candidate pairs
>>> BBV: found 0 pair connections.
>>> BBV: done!
>>> 
>>> However, this was run on the unrolled loop (I guess).
>>> 
>>> Here is the IR printed by 'opt':
>>> 
>>> entry:
>>> %cmp9 = icmp ult i64 %start, %end
>>> br i1 %cmp9, label %for.body, label %for.end
>>> 
>>> for.body: ; preds = %entry, %for.body
>>> %storemerge10 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
>>> %div = lshr i64 %storemerge10, 2
>>> %mul1 = shl i64 %div, 3
>>> %rem = and i64 %storemerge10, 3
>>> %add2 = or i64 %mul1, %rem
>>> %0 = lshr i64 %storemerge10, 1
>>> %add51 = shl i64 %0, 2
>>> %mul6 = or i64 %rem, %add51
>>> %add8 = or i64 %mul6, 4
>>> %arrayidx = getelementptr inbounds float* %a, i64 %add2
>>> %1 = load float* %arrayidx, align 4
>>> %arrayidx9 = getelementptr inbounds float* %b, i64 %add2
>>> %2 = load float* %arrayidx9, align 4
>>> %add10 = fadd float %1, %2
>>> %arrayidx11 = getelementptr inbounds float* %c, i64 %add2
>>> store float %add10, float* %arrayidx11, align 4
>>> %arrayidx12 = getelementptr inbounds float* %a, i64 %add8
>>> %3 = load float* %arrayidx12, align 4
>>> %arrayidx13 = getelementptr inbounds float* %b, i64 %add8
>>> %4 = load float* %arrayidx13, align 4
>>> %add14 = fadd float %3, %4
>>> %arrayidx15 = getelementptr inbounds float* %c, i64 %add8
>>> store float %add14, float* %arrayidx15, align 4
>>> %inc = add i64 %storemerge10, 1
>>> %exitcond = icmp eq i64 %inc, %end
>>> br i1 %exitcond, label %for.end, label %for.body
>>> 
>>> for.end: ; preds = %for.body, %entry
>>> ret void
>>> 
>>> 
>>> Is what you're saying that I should unroll the loop first by a given
>>> factor and then run SLP again? How would I do that say for a factor
>>> of 2?
>>> 
>>> Frank
>>> 
>>> 
>>> 
>>> On 30/10/13 13:28, Renato Golin wrote:
>>> 
>>> 
>>> 
>>> 
>>> On 30 October 2013 09:25, Nadav Rotem < nrotem at apple.com > wrote:
>>> 
>>> 
>>> The access pattern to arrays a and b is non-linear. Unrolled loops
>>> are usually handled by the SLP-vectorizer. Are ir0 and ir1
>>> consecutive for all values for i ?
>>> 
>>> 
>>> Based on his list of values, it seems that the induction stride is
>>> linear within each block of 4 iterations, but it's not a clear
>>> relationship.
>>> 
>>> 
>>> As you say, it should be possible to spot that once the loop is
>>> unrolled, and get the SLP to vectorize if the relationship becomes
>>> clear.
>>> 
>>> 
>>> Maybe I'm wrong, but this looks like a problem of missed
>>> opportunities, not technically hard to implement.
>>> 
>>> 
>>> --renato
>>> 
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> 
> 
>