[LLVMdev] loop vectorizer misses opportunity, exploit

Thu Oct 31 10:38:15 PDT 2013

On 31/10/13 13:34, Hal Finkel wrote:
> ----- Original Message -----
>> Hi Nadav,
>>
>> that's the whole point of it. I can't in general make the index
>> calculation simpler. The example given is the simplest non-trivial
>> index function that is needed. It might well be that it's that
>> simple that the index calculation in this case can be thrown aways
>> altogether and - as you say - be replaced by the simple loop you
>> mentioned. However, this cannot be done in the general case. In the
>> general case the index calculation requires the 'rem' and 'div'
>> instruction. The OR instruction must be the result of an arithmetic
>> transformation of one of the previous passes (due to -O3?).
>>
>> I don't see a way around to make the vectorizers recognize such
>> arithmetic operations.
>>
>> Do you think this can be done relatively quickly or does this involve
>> a huge effort?
>>
>> What does 'stepping through' the loop vectorizer mean? Using the
>> debugger and step through the program? Probably not. Is the way to
>> go, to alter the 'canVectorize' function print debug output to see
>> what's going on?
>>
>> Hal, you seem to know the loop vectorizer.
> Heh ;) Nadav wrote it.
I know. But you said that it doesn't use the SCEV AA information but 
uses its interal SCEV. Being able to say this requires knowledge of the 
loop vectorizer. That's why I said it.
>
> I've CC'd Andy, who is our SCEV expert; hopefully he can comment on how difficult it would be to make SCEV understand the indexing calculations.
>
>   -Hal
>
>> Is this the place to look
>> at or is the SLP vectorizer the more promising place?
>>
>> Frank
>>
>>
>> On 31/10/13 12:50, Nadav Rotem wrote:
>>
>>
>>
>> Hi Frank,
>>
>>
>> This loop should be vectorized by the SLP-vectorizer. It has several
>> scalars (C[0], C[1] … ) that can be merged into a vector. The SLP
>> vectorizer can’t figure out that the stores are consecutive because
>> SCEV can’t analyze the OR in the index calculation:
>>
>> %2 = and i64 %i.04, 3
>> %3 = lshr i64 %i.04, 2
>> %4 = shl i64 %3, 3
>> %5 = or i64 %4, %2
>> %11 = getelementptr inbounds float* %c, i64 %5
>> store float %10, float* %11, align 4, !tbaa !0
>>
>>
>> You wrote that you want each iteration to look like this:
>>
>>
>>
>>
>>
>>
>>
>> i 0: 0 1 2 3
>> i 4: 8 9 10 11
>> i 8: 16 17 18 19
>> i 12: 24 25 26 27
>>
>>
>>
>> Why can’t you just write a small loop like this: for (i=0; i<4; i++)
>> C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the
>> SLP-vectorizer will vectorize the unrolled iterations, or the
>> loop-vectorizer would catch it.
>>
>>
>> Thanks,
>> Nadav
>>
>>
>> On Oct 31, 2013, at 8:01 AM, Frank Winter < fwinter at jlab.org > wrote:
>>
>>
>>
>>
>> A quite small but yet complete example function which all
>> vectorization passes fail to optimize:
>>
>> #include <cstdint>
>> #include <iostream>
>>
>> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__
>> c, float * __restrict__ a, float * __restrict__ b)
>> {
>> for ( std::uint64_t i = start ; i < end ; i += 4 ) {
>> {
>> const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> }
>> {
>> const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> }
>> {
>> const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> }
>> {
>> const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> }
>> }
>> }
>>
>> The loop index and array accesses for the first 4 iterations:
>>
>> i 0: 0 1 2 3
>> i 4: 8 9 10 11
>> i 8: 16 17 18 19
>> i 12: 24 25 26 27
>>
>> For example on an x86 processor with SSE (128 bit SIMD vectors) the
>> loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1
>> SIMD store.
>>
>> With current trunk I tried the following on the above example:
>>
>> clang++ -emit-llvm -S loop_minimal.cc -std=c++11
>> opt -O3 -vectorize-slp -S loop_minimal.ll
>> opt -O3 -loop-vectorize -S loop_minimal.ll
>> opt -O3 -bb-vectorize -S loop_minimal.ll
>>
>> All optimization passes miss the opportunity. It seems the SCEV AA
>> pass doesn't understand modulo arithmetic.
>>
>> How can the SCEV AA pass be extended to handle this type of
>> arithmetic? Frank
>>
>> On 31/10/13 02:21, Renato Golin wrote:
>>
>>
>> On 30 October 2013 18:40, Frank Winter < fwinter at jlab.org > wrote:
>>
>>
>>
>>
>>
>>
>> const std::uint64_t ir0 = (i+0)%4; // not working
>>
>>
>>
>> I thought this would be the case when I saw the original expression.
>> Maybe we need to teach module arithmetic to SCEV?
>>
>> --renato
>>
>>
>>
>>
>>