[LLVMdev] loop vectorizer misses opportunity, exploit

Thu Oct 31 10:49:44 PDT 2013

Here's a real example of IR (as opposed to the C example provided 
earlier) which includes the type of index calculation that lets the loop 
vectorizer and SLP vectorizer (after unrolling) fail:

define void @main(i64 %arg0, i64 %arg1, i1 %arg2, i64 %arg3, float* 
noalias %arg4, float* noalias %arg5, float* noalias %arg6) {
entrypoint:
   br i1 %arg2, label %L0, label %L1

L0:                                               ; preds = %entrypoint
   %0 = add nsw i64 %arg0, %arg3
   %1 = add nsw i64 %arg1, %arg3
   br label %L2

L1:                                               ; preds = %entrypoint
   br label %L2

L2:                                               ; preds = %L0, %L1
   %2 = phi i64 [ %arg0, %L1 ], [ %0, %L0 ]
   %3 = phi i64 [ %arg1, %L1 ], [ %1, %L0 ]
   br label %L3

L3:                                               ; preds = %L3, %L2
   %4 = phi i64 [ %25, %L3 ], [ %2, %L2 ]
   %5 = sdiv i64 %4, 4
   %6 = srem i64 %4, 4
   %7 = mul i64 %5, 2
   %8 = mul i64 %7, 4
   %9 = add nsw i64 %8, %6
   %10 = add nsw i64 %7, 1
   %11 = mul i64 %10, 4
   %12 = add nsw i64 %11, %6
   %13 = getelementptr float* %arg5, i64 %9
   %14 = load float* %13
   %15 = getelementptr float* %arg5, i64 %12
   %16 = load float* %15
   %17 = getelementptr float* %arg6, i64 %9
   %18 = load float* %17
   %19 = getelementptr float* %arg6, i64 %12
   %20 = load float* %19
   %21 = fadd float %20, %16
   %22 = fadd float %18, %14
   %23 = getelementptr float* %arg4, i64 %9
   store float %22, float* %23
   %24 = getelementptr float* %arg4, i64 %12
   store float %21, float* %24
   %25 = add nsw i64 %4, 1
   %26 = icmp sge i64 %25, %3
   br i1 %26, label %L4, label %L3

L4:                                               ; preds = %L3
   ret void
}

Right now we have:

LV: Checking a loop in "main"
LV: Found a loop: L3
LV: Found an induction variable.
LV: We need to do 0 pointer comparisons.
LV: Checking memory dependencies
LV: Bad stride - Not an AddRecExpr pointer   %23 = getelementptr float* 
%arg4, i64 %9 SCEV: ((4 * ((8 * %5) + %6)) + %arg4)
LV: Bad stride - Not an AddRecExpr pointer   %24 = getelementptr float* 
%arg4, i64 %12 SCEV: (16 + (4 * %6) + (32 * %5) + %arg4)
LV: Src Scev: ((4 * ((8 * %5) + %6)) + %arg4)Sink Scev: (16 + (4 * %6) + 
(32 * %5) + %arg4)(Induction step: 0)
LV: Distance for   store float %22, float* %23 to   store float %21, 
float* %24: 16
Non-consecutive pointer access
LV: We don't need a runtime memory check.
LV: Can't vectorize due to memory conflicts
LV: Not vectorizing.

However, 4 consecutive iterations could be combined (for an x86 SSE 
SIMD). (With a SIMD width of 8 floats, it would be that 8 iterations 
could be combined. In that case the literal constant in %8 = mul i64 %7, 
4 would be 8 of course).

Even after unrolling this loop by hand (factor 4), the SLP vectorizer fails.

Frank

On 31/10/13 13:34, Hal Finkel wrote:
> ----- Original Message -----
>> Hi Nadav,
>>
>> that's the whole point of it. I can't in general make the index
>> calculation simpler. The example given is the simplest non-trivial
>> index function that is needed. It might well be that it's that
>> simple that the index calculation in this case can be thrown aways
>> altogether and - as you say - be replaced by the simple loop you
>> mentioned. However, this cannot be done in the general case. In the
>> general case the index calculation requires the 'rem' and 'div'
>> instruction. The OR instruction must be the result of an arithmetic
>> transformation of one of the previous passes (due to -O3?).
>>
>> I don't see a way around to make the vectorizers recognize such
>> arithmetic operations.
>>
>> Do you think this can be done relatively quickly or does this involve
>> a huge effort?
>>
>> What does 'stepping through' the loop vectorizer mean? Using the
>> debugger and step through the program? Probably not. Is the way to
>> go, to alter the 'canVectorize' function print debug output to see
>> what's going on?
>>
>> Hal, you seem to know the loop vectorizer.
> Heh ;) Nadav wrote it.
>
> I've CC'd Andy, who is our SCEV expert; hopefully he can comment on how difficult it would be to make SCEV understand the indexing calculations.
>
>   -Hal
>
>> Is this the place to look
>> at or is the SLP vectorizer the more promising place?
>>
>> Frank
>>
>>
>> On 31/10/13 12:50, Nadav Rotem wrote:
>>
>>
>>
>> Hi Frank,
>>
>>
>> This loop should be vectorized by the SLP-vectorizer. It has several
>> scalars (C[0], C[1] … ) that can be merged into a vector. The SLP
>> vectorizer can’t figure out that the stores are consecutive because
>> SCEV can’t analyze the OR in the index calculation:
>>
>> %2 = and i64 %i.04, 3
>> %3 = lshr i64 %i.04, 2
>> %4 = shl i64 %3, 3
>> %5 = or i64 %4, %2
>> %11 = getelementptr inbounds float* %c, i64 %5
>> store float %10, float* %11, align 4, !tbaa !0
>>
>>
>> You wrote that you want each iteration to look like this:
>>
>>
>>
>>
>>
>>
>>
>> i 0: 0 1 2 3
>> i 4: 8 9 10 11
>> i 8: 16 17 18 19
>> i 12: 24 25 26 27
>>
>>
>>
>> Why can’t you just write a small loop like this: for (i=0; i<4; i++)
>> C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the
>> SLP-vectorizer will vectorize the unrolled iterations, or the
>> loop-vectorizer would catch it.
>>
>>
>> Thanks,
>> Nadav
>>
>>
>> On Oct 31, 2013, at 8:01 AM, Frank Winter < fwinter at jlab.org > wrote:
>>
>>
>>
>>
>> A quite small but yet complete example function which all
>> vectorization passes fail to optimize:
>>
>> #include <cstdint>
>> #include <iostream>
>>
>> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__
>> c, float * __restrict__ a, float * __restrict__ b)
>> {
>> for ( std::uint64_t i = start ; i < end ; i += 4 ) {
>> {
>> const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> }
>> {
>> const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> }
>> {
>> const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> }
>> {
>> const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
>> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
>> }
>> }
>> }
>>
>> The loop index and array accesses for the first 4 iterations:
>>
>> i 0: 0 1 2 3
>> i 4: 8 9 10 11
>> i 8: 16 17 18 19
>> i 12: 24 25 26 27
>>
>> For example on an x86 processor with SSE (128 bit SIMD vectors) the
>> loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1
>> SIMD store.
>>
>> With current trunk I tried the following on the above example:
>>
>> clang++ -emit-llvm -S loop_minimal.cc -std=c++11
>> opt -O3 -vectorize-slp -S loop_minimal.ll
>> opt -O3 -loop-vectorize -S loop_minimal.ll
>> opt -O3 -bb-vectorize -S loop_minimal.ll
>>
>> All optimization passes miss the opportunity. It seems the SCEV AA
>> pass doesn't understand modulo arithmetic.
>>
>> How can the SCEV AA pass be extended to handle this type of
>> arithmetic? Frank
>>
>> On 31/10/13 02:21, Renato Golin wrote:
>>
>>
>> On 30 October 2013 18:40, Frank Winter < fwinter at jlab.org > wrote:
>>
>>
>>
>>
>>
>>
>> const std::uint64_t ir0 = (i+0)%4; // not working
>>
>>
>>
>> I thought this would be the case when I saw the original expression.
>> Maybe we need to teach module arithmetic to SCEV?
>>
>> --renato
>>
>>
>>
>>
>>