[llvm-commits] [llvm] r171798 - in /llvm/trunk: lib/Transforms/Vectorize/LoopVectorize.cpp test/Transforms/LoopVectorize/X86/unroll-small-loops.ll

Mon Jan 7 21:38:24 PST 2013

Not always beneficial in terms for performance. But certainly win in 
code size.

I can modify my two examples such that trip-count is exactly the 
vector-length (2).

eg1.
    for (i = 0; i < 2; i++) a[i] = ...

  The vecorized version may suffer from unaliased load,
however, scalar version dose not have such problem.

eg2.

     for (i = 0; i < 2; i++)  {
       a[i] = ..
              = a[i-1]
     }

   in vectorized version, each memory unit has to be load exactly once, and write
exactly once.

   In contrast, in scalar version, the a[i-1] will be replaced with register.

On 01/07/2013 09:29 PM, Chris Lattner wrote:
> On Jan 7, 2013, at 9:21 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:
>> IMHO, it is not always possible to statically determine if it's beneficial to vectorize a
>> loop with small(tiny?) trip count. Here are two examples:
> Here's another way of trying to say the same thing: if we don't need a scalar cleanup loop (e.g. because the vectorization factor of a loop is known to subdivide the constant tripcount), isn't it always beneficial to do the vectorization, even if the new tripcount is low?
>
> -Chris
>
>> e.g1 :  suppose HW has 16-byte SIMD support.
>>
>>     double a[];
>>     for (i = 0; i < 3; i++)
>>         a[i] = ....
>>
>>    We have 2 ways to vect this loop:
>>   vect1:
>>     a[0:1] = ...
>>     a[2] = ...
>>
>>   vect2:
>>     a[0] = ...
>>     a[1..2] =
>>
>>   Unless we know the alignment of the array <a> wrt 16-byte boundary, we are not
>> able to determine which one works better. If we unfortunately pick up the
>> one with unaligned access, the performance may be worse than the
>> un-vectorized version.
>>
>> e.g2.
>>     for (i = 0; i < very-small-num; i++)  {
>>        a[i] = ..
>>               = a[i-1]
>>      }
>>
>>     If it is vectorized, we have
>>       for (...) {
>>          a[i:i+1] =
>>                      = a[i-1:i]
>>       }
>>       [ remainder scalar loop]
>>
>>      In the vectorized version, the load and store cannot be scalar-replaced.
>> therefore, each memory unit need to be accessed twice, including one access
>> which is bound to be unaligned.
>>
>>      In contrast, in the un-vectorized, "a[i]" and a[i-1]" can be scalar replaced,
>> therefore each memory unit is accessed only once.
>>
>>     It is very difficult to tell if SIMD wins. It depends the neighboring code, the
>> humidity, the outdoor temperature etc etc etc.
>>
>>    In my humble experience in another compiler,  if I set threshold of trip-count
>> less than 4, the performance starts to slightly fluctuate. But I think threshold
>> "trip-count=16" is bit conservative.
>>
>>
>> On 01/07/2013 05:02 PM, Chris Lattner wrote:
>>> On Jan 7, 2013, at 1:54 PM, Nadav Rotem <nrotem at apple.com> wrote:
>>>
>>>> Author: nadav
>>>> Date: Mon Jan  7 15:54:51 2013
>>>> New Revision: 171798
>>>>
>>>> URL: http://llvm.org/viewvc/llvm-project?rev=171798&view=rev
>>>> Log:
>>>> LoopVectorizer: When we vectorizer and widen loops we process many elements at once. This is a good thing, except for
>>>> small loops. On small loops post-loop that handles scalars (and runs slower) can take more time to execute than the
>>>> rest of the loop. This patch disables widening of loops with a small static trip count.
>>> Isn't it still (extremely) valuable to vectorize loops that are a multiple of the vectorization threshold?  Turning a loop that adds 4 element arrays into a single SIMD add is a pretty nice win and requires no cleanup loop.
>>>
>>> -Chris
>>>
>>>