[llvm-commits] [llvm] r171798 - in /llvm/trunk: lib/Transforms/Vectorize/LoopVectorize.cpp test/Transforms/LoopVectorize/X86/unroll-small-loops.ll

Mon Jan 7 21:21:48 PST 2013

IMHO, it is not always possible to statically determine if it's 
beneficial to vectorize a
loop with small(tiny?) trip count. Here are two examples:

  e.g1 :  suppose HW has 16-byte SIMD support.

     double a[];
     for (i = 0; i < 3; i++)
         a[i] = ....

    We have 2 ways to vect this loop:
   vect1:
     a[0:1] = ...
     a[2] = ...

   vect2:
     a[0] = ...
     a[1..2] =

   Unless we know the alignment of the array <a> wrt 16-byte boundary, 
we are not
able to determine which one works better. If we unfortunately pick up the
one with unaligned access, the performance may be worse than the
un-vectorized version.

  e.g2.
     for (i = 0; i < very-small-num; i++)  {
        a[i] = ..
               = a[i-1]
      }

     If it is vectorized, we have
       for (...) {
          a[i:i+1] =
                      = a[i-1:i]
       }
       [ remainder scalar loop]

      In the vectorized version, the load and store cannot be 
scalar-replaced.
therefore, each memory unit need to be accessed twice, including one access
which is bound to be unaligned.

      In contrast, in the un-vectorized, "a[i]" and a[i-1]" can be 
scalar replaced,
therefore each memory unit is accessed only once.

     It is very difficult to tell if SIMD wins. It depends the 
neighboring code, the
humidity, the outdoor temperature etc etc etc.

    In my humble experience in another compiler,  if I set threshold of 
trip-count
less than 4, the performance starts to slightly fluctuate. But I think 
threshold
"trip-count=16" is bit conservative.

On 01/07/2013 05:02 PM, Chris Lattner wrote:
> On Jan 7, 2013, at 1:54 PM, Nadav Rotem <nrotem at apple.com> wrote:
>
>> Author: nadav
>> Date: Mon Jan  7 15:54:51 2013
>> New Revision: 171798
>>
>> URL: http://llvm.org/viewvc/llvm-project?rev=171798&view=rev
>> Log:
>> LoopVectorizer: When we vectorizer and widen loops we process many elements at once. This is a good thing, except for
>> small loops. On small loops post-loop that handles scalars (and runs slower) can take more time to execute than the
>> rest of the loop. This patch disables widening of loops with a small static trip count.
> Isn't it still (extremely) valuable to vectorize loops that are a multiple of the vectorization threshold?  Turning a loop that adds 4 element arrays into a single SIMD add is a pretty nice win and requires no cleanup loop.
>
> -Chris
>
>