[llvm] r200213 - [vectorizer] Teach the loop vectorizer's unroller to only unroll by

Mon Jan 27 03:12:24 PST 2014

Author: chandlerc
Date: Mon Jan 27 05:12:24 2014
New Revision: 200213

URL: http://llvm.org/viewvc/llvm-project?rev=200213&view=rev
Log:
[vectorizer] Teach the loop vectorizer's unroller to only unroll by
powers of two. This is essentially always the correct thing given the
impact on alignment, scaling factors that can be used in addressing
modes, etc. Also, fix the management of the unroll vs. small loop cost
to more accurately model things with this world.

Enhance a test case to actually exercise more of the unroll machinery if
using synthetic constants rather than a specific target model. Before
this change, with the added flags this test will unroll 3 times instead
of either 2 or 4 (the two sensible answers).

While I don't expect this to make a huge difference, if there are lots
of loops sitting right on the edge of hitting the 'small unroll' factor,
they might change behavior. However, I've benchmarked moving the small
loop cost up and down in many various ways and by a huge factor (2x)
without seeing more than 0.2% code size growth. Small adjustments such
as the series that led up here have led to about 1% improvement on some
benchmarks, but it is very close to the noise floor so I mostly checked
that nothing regressed. Let me know if you see bad behavior on other
targets but I don't expect this to be a sufficiently dramatic change to
trigger anything.

Modified:
    llvm/trunk/include/llvm/Support/MathExtras.h
    llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp
    llvm/trunk/test/Transforms/LoopVectorize/unroll_novec.ll

Modified: llvm/trunk/include/llvm/Support/MathExtras.h
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/include/llvm/Support/MathExtras.h?rev=200213&r1=200212&r2=200213&view=diff
==============================================================================

--- llvm/trunk/include/llvm/Support/MathExtras.h (original)
+++ llvm/trunk/include/llvm/Support/MathExtras.h Mon Jan 27 05:12:24 2014
@@ -552,6 +552,13 @@ inline uint64_t NextPowerOf2(uint64_t A)
   return A + 1;
 }
 
+/// Returns the power of two which is less than or equal to the given value.
+/// Essentially, it is a floor operation across the domain of powers of two.
+inline uint64_t PowerOf2Floor(uint64_t A) {
+  if (!A) return 0;
+  return 1ull << (63 - countLeadingZeros(A, ZB_Undefined));
+}
+
 /// Returns the next integer (mod 2**64) that is greater than or equal to
 /// \p Value and is a multiple of \p Align. \p Align must be non-zero.
 ///

Modified: llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp?rev=200213&r1=200212&r2=200213&view=diff
==============================================================================
--- llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp (original)
+++ llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp Mon Jan 27 05:12:24 2014
@@ -5004,8 +5004,11 @@ LoopVectorizationCostModel::selectUnroll
   // registers. These registers are used by all of the unrolled instances.
   // Next, divide the remaining registers by the number of registers that is
   // required by the loop, in order to estimate how many parallel instances
-  // fit without causing spills.
-  unsigned UF = (TargetNumRegisters - R.LoopInvariantRegs) / R.MaxLocalUsers;
+  // fit without causing spills. All of this is rounded down if necessary to be
+  // a power of two. We want power of two unroll factors to simplify any
+  // addressing operations or alignment considerations.
+  unsigned UF = PowerOf2Floor((TargetNumRegisters - R.LoopInvariantRegs) /
+                              R.MaxLocalUsers);
 
   // Clamp the unroll factor ranges to reasonable factors.
   unsigned MaxUnrollSize = TTI.getMaximumUnrollFactor();
@@ -5045,7 +5048,7 @@ LoopVectorizationCostModel::selectUnroll
   DEBUG(dbgs() << "LV: Loop cost is " << LoopCost << '\n');
   if (LoopCost < SmallLoopCost) {
     DEBUG(dbgs() << "LV: Unrolling to reduce branch cost.\n");
-    unsigned NewUF = SmallLoopCost / (LoopCost + 1);
+    unsigned NewUF = PowerOf2Floor(SmallLoopCost / LoopCost);
     return std::min(NewUF, UF);
   }
 

Modified: llvm/trunk/test/Transforms/LoopVectorize/unroll_novec.ll
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/LoopVectorize/unroll_novec.ll?rev=200213&r1=200212&r2=200213&view=diff
==============================================================================
--- llvm/trunk/test/Transforms/LoopVectorize/unroll_novec.ll (original)
+++ llvm/trunk/test/Transforms/LoopVectorize/unroll_novec.ll Mon Jan 27 05:12:24 2014
@@ -1,4 +1,4 @@
-; RUN: opt < %s  -loop-vectorize -force-vector-width=1 -force-vector-unroll=2 -dce -instcombine -S | FileCheck %s
+; RUN: opt < %s  -loop-vectorize -force-vector-width=1 -force-target-num-scalar-regs=16 -force-target-max-scalar-unroll=8 -small-loop-cost=20 -dce -instcombine -S | FileCheck %s
 
 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
 target triple = "x86_64-apple-macosx10.8.0"
@@ -12,10 +12,20 @@ target triple = "x86_64-apple-macosx10.8
 ;CHECK-LABEL: @inc(
 ;CHECK: load i32*
 ;CHECK: load i32*
+;CHECK: load i32*
+;CHECK: load i32*
+;CHECK-NOT: load i32*
+;CHECK: add nsw i32
 ;CHECK: add nsw i32
 ;CHECK: add nsw i32
+;CHECK: add nsw i32
+;CHECK-NOT: add nsw i32
+;CHECK: store i32
+;CHECK: store i32
 ;CHECK: store i32
 ;CHECK: store i32
+;CHECK-NOT: store i32
+;CHECK: add i64 %{{.*}}, 4
 ;CHECK: ret void
 define void @inc(i32 %n) nounwind uwtable noinline ssp {
   %1 = icmp sgt i32 %n, 0