[LLVMdev] loop vectorizer misses opportunity, exploit

Thu Oct 31 09:50:53 PDT 2013

Hi Frank, 

This loop should be vectorized by the SLP-vectorizer.  It has several scalars (C[0], C[1] … ) that can be merged into a vector.  The SLP vectorizer can’t figure out that the stores are consecutive because SCEV can’t analyze the OR in the index calculation:

  %2 = and i64 %i.04, 3
  %3 = lshr i64 %i.04, 2
  %4 = shl i64 %3, 3
  %5 = or i64 %4, %2
  %11 = getelementptr inbounds float* %c, i64 %5
  store float %10, float* %11, align 4, !tbaa !0

You wrote that you want each iteration to look like this:

> 
> i 0:     0 1 2 3 
> i 4:     8 9 10 11 
> i 8:     16 17 18 19 
> i 12:     24 25 26 27

Why can’t you just write a small loop like this:   for (i=0; i<4; i++) C[i] = A[i] + B[i]  ??    Either the unroller will unroll it and the SLP-vectorizer will vectorize the unrolled iterations, or the loop-vectorizer would catch it. 

Thanks,
Nadav

On Oct 31, 2013, at 8:01 AM, Frank Winter <fwinter at jlab.org> wrote:

> A quite small but yet complete example function which all vectorization passes fail to optimize:
> 
> #include <cstdint>
> #include <iostream>
> 
> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__  c, float * __restrict__ a, float * __restrict__ b)
> {
>   for ( std::uint64_t i = start ; i < end ; i += 4 ) {
>     {
>       const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>     }
>     {
>       const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>     }
>     {
>       const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>     }
>     {
>       const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>     }
>   }
> } 
> 
> The loop index and array accesses for the first 4 iterations:
> 
> i 0:     0 1 2 3 
> i 4:     8 9 10 11 
> i 8:     16 17 18 19 
> i 12:     24 25 26 27
> 
> For example on an x86 processor with SSE (128 bit SIMD vectors) the loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store.
> 
> With current trunk I tried the following on the above example:
> 
> clang++ -emit-llvm -S loop_minimal.cc -std=c++11
> opt -O3 -vectorize-slp -S loop_minimal.ll
> opt -O3 -loop-vectorize -S loop_minimal.ll
> opt -O3 -bb-vectorize -S loop_minimal.ll
> 
> All optimization passes miss the opportunity. It seems the SCEV AA pass doesn't understand modulo arithmetic.
> 
> How can the SCEV AA pass be extended to handle this type of arithmetic?
> Frank
> 
> On 31/10/13 02:21, Renato Golin wrote:
>> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org> wrote:
>>       const std::uint64_t ir0 = (i+0)%4;  // not working
>> 
>> I thought this would be the case when I saw the original expression. Maybe we need to teach module arithmetic to SCEV?
>> 
>> --renato
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131031/797e3f56/attachment.html>