[LLVMdev] loop vectorizer misses opportunity, exploit
Nadav Rotem
nrotem at apple.com
Thu Oct 31 09:50:53 PDT 2013
Hi Frank,
This loop should be vectorized by the SLP-vectorizer. It has several scalars (C[0], C[1] … ) that can be merged into a vector. The SLP vectorizer can’t figure out that the stores are consecutive because SCEV can’t analyze the OR in the index calculation:
%2 = and i64 %i.04, 3
%3 = lshr i64 %i.04, 2
%4 = shl i64 %3, 3
%5 = or i64 %4, %2
%11 = getelementptr inbounds float* %c, i64 %5
store float %10, float* %11, align 4, !tbaa !0
You wrote that you want each iteration to look like this:
>
> i 0: 0 1 2 3
> i 4: 8 9 10 11
> i 8: 16 17 18 19
> i 12: 24 25 26 27
Why can’t you just write a small loop like this: for (i=0; i<4; i++) C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the SLP-vectorizer will vectorize the unrolled iterations, or the loop-vectorizer would catch it.
Thanks,
Nadav
On Oct 31, 2013, at 8:01 AM, Frank Winter <fwinter at jlab.org> wrote:
> A quite small but yet complete example function which all vectorization passes fail to optimize:
>
> #include <cstdint>
> #include <iostream>
>
> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b)
> {
> for ( std::uint64_t i = start ; i < end ; i += 4 ) {
> {
> const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> }
> {
> const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> }
> {
> const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> }
> {
> const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> }
> }
> }
>
> The loop index and array accesses for the first 4 iterations:
>
> i 0: 0 1 2 3
> i 4: 8 9 10 11
> i 8: 16 17 18 19
> i 12: 24 25 26 27
>
> For example on an x86 processor with SSE (128 bit SIMD vectors) the loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store.
>
> With current trunk I tried the following on the above example:
>
> clang++ -emit-llvm -S loop_minimal.cc -std=c++11
> opt -O3 -vectorize-slp -S loop_minimal.ll
> opt -O3 -loop-vectorize -S loop_minimal.ll
> opt -O3 -bb-vectorize -S loop_minimal.ll
>
> All optimization passes miss the opportunity. It seems the SCEV AA pass doesn't understand modulo arithmetic.
>
> How can the SCEV AA pass be extended to handle this type of arithmetic?
> Frank
>
> On 31/10/13 02:21, Renato Golin wrote:
>> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org> wrote:
>> const std::uint64_t ir0 = (i+0)%4; // not working
>>
>> I thought this would be the case when I saw the original expression. Maybe we need to teach module arithmetic to SCEV?
>>
>> --renato
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131031/797e3f56/attachment.html>
More information about the llvm-dev
mailing list