[LLVMdev] loop vectorizer misses opportunity, exploit

Thu Oct 31 10:24:52 PDT 2013

Hi Nadav,

that's the whole point of it. I can't in general make the index 
calculation simpler. The example given is the simplest non-trivial index 
function that is needed. It might well be that it's that simple that the 
index calculation in this case can be thrown aways altogether and - as 
you say - be replaced by the simple loop you mentioned. However, this 
cannot be done in the general case. In the general case the index 
calculation requires the 'rem' and 'div' instruction. The OR instruction 
must be the result of an arithmetic transformation of one of the 
previous passes (due to -O3?).

I don't see a way around to make the vectorizers recognize such 
arithmetic operations.

Do you think this can be done relatively quickly or does this involve a 
huge effort?

What does 'stepping through' the loop vectorizer mean? Using the 
debugger and step through the program? Probably not. Is the way to go, 
to alter the 'canVectorize' function print debug output to see what's 
going on?

Hal, you seem to know the loop vectorizer. Is this the place to look at 
or is the SLP vectorizer the more promising place?

Frank

On 31/10/13 12:50, Nadav Rotem wrote:
> Hi Frank,
>
> This loop should be vectorized by the SLP-vectorizer.  It has several 
> scalars (C[0], C[1] … ) that can be merged into a vector.  The SLP 
> vectorizer can’t figure out that the stores are consecutive because 
> SCEV can’t analyze the OR in the index calculation:
>   %2 = and i64 %i.04, 3
>   %3 = lshr i64 %i.04, 2
>   %4 = shl i64 %3, 3
>   %5 = or i64 %4, %2
>   %11 = getelementptr inbounds float* %c, i64 %5
>   store float %10, float* %11, align 4, !tbaa !0
>
> You wrote that you want each iteration to look like this:
>
>>
>> i 0:     0 1 2 3
>> i 4:     8 9 10 11
>> i 8:     16 17 18 19
>> i 12:     24 25 26 27
>
> Why can’t you just write a small loop like this:   for (i=0; i<4; i++) 
> C[i] = A[i] + B[i]  ??    Either the unroller will unroll it and the 
> SLP-vectorizer will vectorize the unrolled iterations, or the 
> loop-vectorizer would catch it.
>
> Thanks,
> Nadav
>
> On Oct 31, 2013, at 8:01 AM, Frank Winter <fwinter at jlab.org 
> <mailto:fwinter at jlab.org>> wrote:
>
>> A quite small but yet complete example function which all 
>> vectorization passes fail to optimize:
>>
>> #include <cstdint>
>> #include <iostream>
>>
>> void bar(std::uint64_t start, std::uint64_t end, float * 
>> __restrict__  c, float * __restrict__ a, float * __restrict__ b)
>> {
>>   for ( std::uint64_t i = start ; i < end ; i += 4 ) {
>>     {
>>       const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
>>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>>     }
>>     {
>>       const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
>>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>>     }
>>     {
>>       const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
>>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>>     }
>>     {
>>       const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
>>       c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];
>>     }
>>   }
>> }
>>
>> The loop index and array accesses for the first 4 iterations:
>>
>> i 0:     0 1 2 3
>> i 4:     8 9 10 11
>> i 8:     16 17 18 19
>> i 12:     24 25 26 27
>>
>> For example on an x86 processor with SSE (128 bit SIMD vectors) the loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store.
>>
>> With current trunk I tried the following on the above example:
>>
>> clang++ -emit-llvm -S loop_minimal.cc -std=c++11
>> opt -O3 -vectorize-slp -S loop_minimal.ll
>> opt -O3 -loop-vectorize -S loop_minimal.ll
>> opt -O3 -bb-vectorize -S loop_minimal.ll
>>
>> All optimization passes miss the opportunity. It seems the SCEV AA pass doesn't understand modulo arithmetic.
>>
>> How can the SCEV AA pass be extended to handle this type of arithmetic?
>> Frank
>>
>> On 31/10/13 02:21, Renato Golin wrote:
>>> On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org 
>>> <mailto:fwinter at jlab.org>> wrote:
>>>
>>>           const std::uint64_t ir0 = (i+0)%4; // not working
>>>
>>>
>>> I thought this would be the case when I saw the original expression. 
>>> Maybe we need to teach module arithmetic to SCEV?
>>>
>>> --renato
>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131031/cea7d8a9/attachment.html>