[LLVMdev] loop vectorizer misses opportunity, exploit

Thu Oct 31 11:39:59 PDT 2013

----- Original Message -----
> On 31/10/13 13:34, Hal Finkel wrote:
> > ----- Original Message -----
> >> Hi Nadav,
> >>
> >> that's the whole point of it. I can't in general make the index
> >> calculation simpler. The example given is the simplest non-trivial
> >> index function that is needed. It might well be that it's that
> >> simple that the index calculation in this case can be thrown aways
> >> altogether and - as you say - be replaced by the simple loop you
> >> mentioned. However, this cannot be done in the general case. In
> >> the
> >> general case the index calculation requires the 'rem' and 'div'
> >> instruction. The OR instruction must be the result of an
> >> arithmetic
> >> transformation of one of the previous passes (due to -O3?).
> >>
> >> I don't see a way around to make the vectorizers recognize such
> >> arithmetic operations.
> >>
> >> Do you think this can be done relatively quickly or does this
> >> involve
> >> a huge effort?
> >>
> >> What does 'stepping through' the loop vectorizer mean? Using the
> >> debugger and step through the program? Probably not. Is the way to
> >> go, to alter the 'canVectorize' function print debug output to see
> >> what's going on?
> >>
> >> Hal, you seem to know the loop vectorizer.
> > Heh ;) Nadav wrote it.
> I know. But you said that it doesn't use the SCEV AA information but
> uses its interal SCEV. Being able to say this requires knowledge of
> the
> loop vectorizer. That's why I said it.

Fair enough; I'm not sure how well I know it, but I've looked over the code a few times. The memory dependence code is in MemoryDepChecker::isDependent in LoopVectorize.cpp.

Also, Sebastian Pop and I have been ping-ponging on working on delinearization analysis. Sebastian's latest code (http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20131021/192120.html) has some SCEV-based GCD and division code, and that could be useful for this as well.

 -Hal

> >
> > I've CC'd Andy, who is our SCEV expert; hopefully he can comment on
> > how difficult it would be to make SCEV understand the indexing
> > calculations.
> >
> >   -Hal
> >
> >> Is this the place to look
> >> at or is the SLP vectorizer the more promising place?
> >>
> >> Frank
> >>
> >>
> >> On 31/10/13 12:50, Nadav Rotem wrote:
> >>
> >>
> >>
> >> Hi Frank,
> >>
> >>
> >> This loop should be vectorized by the SLP-vectorizer. It has
> >> several
> >> scalars (C[0], C[1] … ) that can be merged into a vector. The SLP
> >> vectorizer can’t figure out that the stores are consecutive
> >> because
> >> SCEV can’t analyze the OR in the index calculation:
> >>
> >> %2 = and i64 %i.04, 3
> >> %3 = lshr i64 %i.04, 2
> >> %4 = shl i64 %3, 3
> >> %5 = or i64 %4, %2
> >> %11 = getelementptr inbounds float* %c, i64 %5
> >> store float %10, float* %11, align 4, !tbaa !0
> >>
> >>
> >> You wrote that you want each iteration to look like this:
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> i 0: 0 1 2 3
> >> i 4: 8 9 10 11
> >> i 8: 16 17 18 19
> >> i 12: 24 25 26 27
> >>
> >>
> >>
> >> Why can’t you just write a small loop like this: for (i=0; i<4;
> >> i++)
> >> C[i] = A[i] + B[i] ?? Either the unroller will unroll it and the
> >> SLP-vectorizer will vectorize the unrolled iterations, or the
> >> loop-vectorizer would catch it.
> >>
> >>
> >> Thanks,
> >> Nadav
> >>
> >>
> >> On Oct 31, 2013, at 8:01 AM, Frank Winter < fwinter at jlab.org >
> >> wrote:
> >>
> >>
> >>
> >>
> >> A quite small but yet complete example function which all
> >> vectorization passes fail to optimize:
> >>
> >> #include <cstdint>
> >> #include <iostream>
> >>
> >> void bar(std::uint64_t start, std::uint64_t end, float *
> >> __restrict__
> >> c, float * __restrict__ a, float * __restrict__ b)
> >> {
> >> for ( std::uint64_t i = start ; i < end ; i += 4 ) {
> >> {
> >> const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
> >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> >> }
> >> {
> >> const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
> >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> >> }
> >> {
> >> const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
> >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> >> }
> >> {
> >> const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
> >> c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
> >> }
> >> }
> >> }
> >>
> >> The loop index and array accesses for the first 4 iterations:
> >>
> >> i 0: 0 1 2 3
> >> i 4: 8 9 10 11
> >> i 8: 16 17 18 19
> >> i 12: 24 25 26 27
> >>
> >> For example on an x86 processor with SSE (128 bit SIMD vectors)
> >> the
> >> loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1
> >> SIMD store.
> >>
> >> With current trunk I tried the following on the above example:
> >>
> >> clang++ -emit-llvm -S loop_minimal.cc -std=c++11
> >> opt -O3 -vectorize-slp -S loop_minimal.ll
> >> opt -O3 -loop-vectorize -S loop_minimal.ll
> >> opt -O3 -bb-vectorize -S loop_minimal.ll
> >>
> >> All optimization passes miss the opportunity. It seems the SCEV AA
> >> pass doesn't understand modulo arithmetic.
> >>
> >> How can the SCEV AA pass be extended to handle this type of
> >> arithmetic? Frank
> >>
> >> On 31/10/13 02:21, Renato Golin wrote:
> >>
> >>
> >> On 30 October 2013 18:40, Frank Winter < fwinter at jlab.org > wrote:
> >>
> >>
> >>
> >>
> >>
> >>
> >> const std::uint64_t ir0 = (i+0)%4; // not working
> >>
> >>
> >>
> >> I thought this would be the case when I saw the original
> >> expression.
> >> Maybe we need to teach module arithmetic to SCEV?
> >>
> >> --renato
> >>
> >>
> >>
> >>
> >>
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory