[LLVMdev] loop vectorizer
Frank Winter
fwinter at jlab.org
Wed Oct 30 11:13:43 PDT 2013
The SLP vectorizer apparently did something in the prologue of the
function (where storing of arguments on the stack happens) which then
got eliminated later on (since I don't see any vector instructions in
the final IR). Below the debug output of the SLP pass:
Args: opt -O1 -vectorize-slp -debug loop.ll -S
SLP: Analyzing blocks in _Z3barmmPfS_S_.
SLP: Found 2 stores to vectorize.
SLP: Analyzing a store chain of length 2.
SLP: Trying to vectorize starting at PHIs (1)
SLP: Vectorizing a list of length = 2.
SLP: Vectorizing a list of length = 2.
SLP: Vectorizing a list of length = 2.
IR produced:
define void @_Z3barmmPfS_S_(i64 %start, i64 %end, float* noalias
nocapture %c, float* noalias nocapture readonly %a, float* noalias
nocapture readonly %b) #3 {
entry:
%cmp14 = icmp ult i64 %start, %end
br i1 %cmp14, label %for.body, label %for.end
for.body: ; preds = %entry,
%for.body
%i.015 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
%div = lshr i64 %i.015, 2
%mul = shl i64 %div, 3
%rem = and i64 %i.015, 3
%add2 = or i64 %mul, %rem
%add8 = or i64 %add2, 4
%arrayidx = getelementptr inbounds float* %a, i64 %add2
%0 = load float* %arrayidx, align 4
%arrayidx9 = getelementptr inbounds float* %b, i64 %add2
%1 = load float* %arrayidx9, align 4
%add10 = fadd float %0, %1
%arrayidx11 = getelementptr inbounds float* %c, i64 %add2
store float %add10, float* %arrayidx11, align 4
%arrayidx12 = getelementptr inbounds float* %a, i64 %add8
%2 = load float* %arrayidx12, align 4
%arrayidx13 = getelementptr inbounds float* %b, i64 %add8
%3 = load float* %arrayidx13, align 4
%add14 = fadd float %2, %3
%arrayidx15 = getelementptr inbounds float* %c, i64 %add8
store float %add14, float* %arrayidx15, align 4
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %end
br i1 %exitcond, label %for.end, label %for.body
for.end: ; preds = %for.body,
%entry
ret void
}
Does this finally mean the current LLVM cannot vectorize the function?:
void bar(std::uint64_t start, std::uint64_t end, float * __restrict__
c, float * __restrict__ a, float * __restrict__ b)
{
const std::uint64_t inner = 4;
for (std::uint64_t i = start ; i < end ; ++i ) {
const std::uint64_t ir0 = ( (i/inner) * 2 + 0 ) * inner + i%4;
const std::uint64_t ir1 = ( (i/inner) * 2 + 1 ) * inner + i%4;
c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
}
}
I was trying the following:
clang++ -emit-llvm -S loop.cc -std=c++11
(this writes loop.ll)
opt -O1 -bb-vectorize -debug loop.ll -S
(I found the -O1 useful to prepare the IR for the loop vectorizer, so I
hoped this would work here as well)
opt -O1 -loop-vectorize -debug-only=loop-vectorize loop.ll -S
opt -loop-unroll -unroll-allow-partial -debug loop.ll -S
(this didn't unroll the loop.) Here the output (relevant for loop-unroll)
Loop Unroll: F[_Z3barmmPfS_S_] Loop %for.cond
Are there any other combinations of passes that I might try?
Frank
On 30/10/13 13:50, Hal Finkel wrote:
> ----- Original Message -----
>>
>> I ran the BB vectorizer as I guess this is the SLP vectorizer.
> No, while the BB vectorizer is doing a form of SLP vectorization, there is a separate SLP vectorization pass which uses a different algorithm. You can pass -vectorize-slp to opt.
>
> -Hal
>
>> BBV: using target information
>> BBV: fusing loop #1 for for.body in _Z3barmmPfS_S_...
>> BBV: found 2 instructions with candidate pairs
>> BBV: found 0 pair connections.
>> BBV: done!
>>
>> However, this was run on the unrolled loop (I guess).
>>
>> Here is the IR printed by 'opt':
>>
>> entry:
>> %cmp9 = icmp ult i64 %start, %end
>> br i1 %cmp9, label %for.body, label %for.end
>>
>> for.body: ; preds = %entry, %for.body
>> %storemerge10 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
>> %div = lshr i64 %storemerge10, 2
>> %mul1 = shl i64 %div, 3
>> %rem = and i64 %storemerge10, 3
>> %add2 = or i64 %mul1, %rem
>> %0 = lshr i64 %storemerge10, 1
>> %add51 = shl i64 %0, 2
>> %mul6 = or i64 %rem, %add51
>> %add8 = or i64 %mul6, 4
>> %arrayidx = getelementptr inbounds float* %a, i64 %add2
>> %1 = load float* %arrayidx, align 4
>> %arrayidx9 = getelementptr inbounds float* %b, i64 %add2
>> %2 = load float* %arrayidx9, align 4
>> %add10 = fadd float %1, %2
>> %arrayidx11 = getelementptr inbounds float* %c, i64 %add2
>> store float %add10, float* %arrayidx11, align 4
>> %arrayidx12 = getelementptr inbounds float* %a, i64 %add8
>> %3 = load float* %arrayidx12, align 4
>> %arrayidx13 = getelementptr inbounds float* %b, i64 %add8
>> %4 = load float* %arrayidx13, align 4
>> %add14 = fadd float %3, %4
>> %arrayidx15 = getelementptr inbounds float* %c, i64 %add8
>> store float %add14, float* %arrayidx15, align 4
>> %inc = add i64 %storemerge10, 1
>> %exitcond = icmp eq i64 %inc, %end
>> br i1 %exitcond, label %for.end, label %for.body
>>
>> for.end: ; preds = %for.body, %entry
>> ret void
>>
>>
>> Is what you're saying that I should unroll the loop first by a given
>> factor and then run SLP again? How would I do that say for a factor
>> of 2?
>>
>> Frank
>>
>>
>>
>> On 30/10/13 13:28, Renato Golin wrote:
>>
>>
>>
>>
>> On 30 October 2013 09:25, Nadav Rotem < nrotem at apple.com > wrote:
>>
>>
>> The access pattern to arrays a and b is non-linear. Unrolled loops
>> are usually handled by the SLP-vectorizer. Are ir0 and ir1
>> consecutive for all values for i ?
>>
>>
>> Based on his list of values, it seems that the induction stride is
>> linear within each block of 4 iterations, but it's not a clear
>> relationship.
>>
>>
>> As you say, it should be possible to spot that once the loop is
>> unrolled, and get the SLP to vectorize if the relationship becomes
>> clear.
>>
>>
>> Maybe I'm wrong, but this looks like a problem of missed
>> opportunities, not technically hard to implement.
>>
>>
>> --renato
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
More information about the llvm-dev
mailing list