[LLVMdev] loop vectorizer misses opportunity, exploit

Hal Finkel hfinkel at anl.gov
Thu Oct 31 10:34:34 PDT 2013


----- Original Message -----
> 
> Hi Nadav,
> 
> that's the whole point of it. I can't in general make the index
> calculation simpler. The example given is the simplest non-trivial
> index function that is needed. It might well be that it's that
> simple that the index calculation in this case can be thrown aways
> altogether and - as you say - be replaced by the simple loop you
> mentioned. However, this cannot be done in the general case. In the
> general case the index calculation requires the 'rem' and 'div'
> instruction. The OR instruction must be the result of an arithmetic
> transformation of one of the previous passes (due to -O3?).
> 
> I don't see a way around to make the vectorizers recognize such
> arithmetic operations.
> 
> Do you think this can be done relatively quickly or does this involve
> a huge effort?
> 
> What does 'stepping through' the loop vectorizer mean? Using the
> debugger and step through the program? Probably not. Is the way to
> go, to alter the 'canVectorize' function print debug output to see
> what's going on?
> 
> Hal, you seem to know the loop vectorizer.

Heh ;) Nadav wrote it.

I've CC'd Andy, who is our SCEV expert; hopefully he can comment on how difficult it would be to make SCEV understand the indexing calculations.

 -Hal

> Is this the place to look
> at or is the SLP vectorizer the more promising place?
> 
> Frank
> 
> 
> On 31/10/13 12:50, Nadav Rotem wrote:
> 
> 
> 
> Hi Frank,
> 
> 
> This loop should be vectorized by the SLP-vectorizer. It has several
> scalars (C[0], C[1] … ) that can be merged into a vector. The SLP
> vectorizer can’t figure out that the stores are consecutive because
> SCEV can’t analyze the OR in the index calculation:
> 
> %2 = and i64 %i.04, 3
> %3 = lshr i64 %i.04, 2
> %4 = shl i64 %3, 3
> %5 = or i64 %4, %2
> %11 = getelementptr inbounds float* %c, i64 %5
> store float %10, float* %11, align 4, !tbaa !0
> 
> 
> You wrote that you want each iteration to look like this:
> 
> 
> 
> 
> 
> 
> 
> i 0: 0 1 2 3
> i 4: 8 9 10 11
> i 8: 16 17 18 19
> i 12: 24 25 26 27
> 
> 
> 
> Why can’t you just write a small loop like this: for (i=0; i<4; i++)
> C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the
> SLP-vectorizer will vectorize the unrolled iterations, or the
> loop-vectorizer would catch it.
> 
> 
> Thanks,
> Nadav
> 
> 
> On Oct 31, 2013, at 8:01 AM, Frank Winter < fwinter at jlab.org > wrote:
> 
> 
> 
> 
> A quite small but yet complete example function which all
> vectorization passes fail to optimize:
> 
> #include <cstdint>
> #include <iostream>
> 
> void bar(std::uint64_t start, std::uint64_t end, float * __restrict__
> c, float * __restrict__ a, float * __restrict__ b)
> {
> for ( std::uint64_t i = start ; i < end ; i += 4 ) {
> {
> const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> }
> {
> const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> }
> {
> const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> }
> {
> const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> }
> }
> }
> 
> The loop index and array accesses for the first 4 iterations:
> 
> i 0: 0 1 2 3
> i 4: 8 9 10 11
> i 8: 16 17 18 19
> i 12: 24 25 26 27
> 
> For example on an x86 processor with SSE (128 bit SIMD vectors) the
> loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1
> SIMD store.
> 
> With current trunk I tried the following on the above example:
> 
> clang++ -emit-llvm -S loop_minimal.cc -std=c++11
> opt -O3 -vectorize-slp -S loop_minimal.ll
> opt -O3 -loop-vectorize -S loop_minimal.ll
> opt -O3 -bb-vectorize -S loop_minimal.ll
> 
> All optimization passes miss the opportunity. It seems the SCEV AA
> pass doesn't understand modulo arithmetic.
> 
> How can the SCEV AA pass be extended to handle this type of
> arithmetic? Frank
> 
> On 31/10/13 02:21, Renato Golin wrote:
> 
> 
> On 30 October 2013 18:40, Frank Winter < fwinter at jlab.org > wrote:
> 
> 
> 
> 
> 
> 
> const std::uint64_t ir0 = (i+0)%4; // not working
> 
> 
> 
> I thought this would be the case when I saw the original expression.
> Maybe we need to teach module arithmetic to SCEV?
> 
> --renato
> 
> 
> 
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory




More information about the llvm-dev mailing list