[llvm-commits] [llvm] r171798 - in /llvm/trunk: lib/Transforms/Vectorize/LoopVectorize.cpp test/Transforms/LoopVectorize/X86/unroll-small-loops.ll

Mon Jan 7 21:29:33 PST 2013

On Jan 7, 2013, at 9:21 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:
> IMHO, it is not always possible to statically determine if it's beneficial to vectorize a
> loop with small(tiny?) trip count. Here are two examples:

Here's another way of trying to say the same thing: if we don't need a scalar cleanup loop (e.g. because the vectorization factor of a loop is known to subdivide the constant tripcount), isn't it always beneficial to do the vectorization, even if the new tripcount is low?

-Chris

> 
> e.g1 :  suppose HW has 16-byte SIMD support.
> 
>    double a[];
>    for (i = 0; i < 3; i++)
>        a[i] = ....
> 
>   We have 2 ways to vect this loop:
>  vect1:
>    a[0:1] = ...
>    a[2] = ...
> 
>  vect2:
>    a[0] = ...
>    a[1..2] =
> 
>  Unless we know the alignment of the array <a> wrt 16-byte boundary, we are not
> able to determine which one works better. If we unfortunately pick up the
> one with unaligned access, the performance may be worse than the
> un-vectorized version.
> 
> e.g2.
>    for (i = 0; i < very-small-num; i++)  {
>       a[i] = ..
>              = a[i-1]
>     }
> 
>    If it is vectorized, we have
>      for (...) {
>         a[i:i+1] =
>                     = a[i-1:i]
>      }
>      [ remainder scalar loop]
> 
>     In the vectorized version, the load and store cannot be scalar-replaced.
> therefore, each memory unit need to be accessed twice, including one access
> which is bound to be unaligned.
> 
>     In contrast, in the un-vectorized, "a[i]" and a[i-1]" can be scalar replaced,
> therefore each memory unit is accessed only once.
> 
>    It is very difficult to tell if SIMD wins. It depends the neighboring code, the
> humidity, the outdoor temperature etc etc etc.
> 
>   In my humble experience in another compiler,  if I set threshold of trip-count
> less than 4, the performance starts to slightly fluctuate. But I think threshold
> "trip-count=16" is bit conservative.
> 
> 
> On 01/07/2013 05:02 PM, Chris Lattner wrote:
>> On Jan 7, 2013, at 1:54 PM, Nadav Rotem <nrotem at apple.com> wrote:
>> 
>>> Author: nadav
>>> Date: Mon Jan  7 15:54:51 2013
>>> New Revision: 171798
>>> 
>>> URL: http://llvm.org/viewvc/llvm-project?rev=171798&view=rev
>>> Log:
>>> LoopVectorizer: When we vectorizer and widen loops we process many elements at once. This is a good thing, except for
>>> small loops. On small loops post-loop that handles scalars (and runs slower) can take more time to execute than the
>>> rest of the loop. This patch disables widening of loops with a small static trip count.
>> Isn't it still (extremely) valuable to vectorize loops that are a multiple of the vectorization threshold?  Turning a loop that adds 4 element arrays into a single SIMD add is a pretty nice win and requires no cleanup loop.
>> 
>> -Chris
>> 
>> 
>