[llvm] r200213 - [vectorizer] Teach the loop vectorizer's unroller to only unroll by

Mon Jan 27 06:36:02 PST 2014

----- Original Message -----
> From: "Chandler Carruth" <chandlerc at gmail.com>
> To: llvm-commits at cs.uiuc.edu
> Sent: Monday, January 27, 2014 5:12:24 AM
> Subject: [llvm] r200213 - [vectorizer] Teach the loop vectorizer's unroller to	only unroll by
> 
> Author: chandlerc
> Date: Mon Jan 27 05:12:24 2014
> New Revision: 200213
> 
> URL: http://llvm.org/viewvc/llvm-project?rev=200213&view=rev
> Log:
> [vectorizer] Teach the loop vectorizer's unroller to only unroll by
> powers of two. This is essentially always the correct thing given the
> impact on alignment, scaling factors that can be used in addressing
> modes, etc.

Chandler, please add a TTI callback to control this. On the PPC A2, it really is a good thing, sometimes, to unroll by 3 or 5. PPC does not have scaled addressing modes, and the important thing there is instruction latency hiding.

 -Hal

> Also, fix the management of the unroll vs. small loop
> cost
> to more accurately model things with this world.
> 
> Enhance a test case to actually exercise more of the unroll machinery
> if
> using synthetic constants rather than a specific target model. Before
> this change, with the added flags this test will unroll 3 times
> instead
> of either 2 or 4 (the two sensible answers).
> 
> While I don't expect this to make a huge difference, if there are
> lots
> of loops sitting right on the edge of hitting the 'small unroll'
> factor,
> they might change behavior. However, I've benchmarked moving the
> small
> loop cost up and down in many various ways and by a huge factor (2x)
> without seeing more than 0.2% code size growth. Small adjustments
> such
> as the series that led up here have led to about 1% improvement on
> some
> benchmarks, but it is very close to the noise floor so I mostly
> checked
> that nothing regressed. Let me know if you see bad behavior on other
> targets but I don't expect this to be a sufficiently dramatic change
> to
> trigger anything.
> 
> Modified:
>     llvm/trunk/include/llvm/Support/MathExtras.h
>     llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp
>     llvm/trunk/test/Transforms/LoopVectorize/unroll_novec.ll
> 
> Modified: llvm/trunk/include/llvm/Support/MathExtras.h
> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/include/llvm/Support/MathExtras.h?rev=200213&r1=200212&r2=200213&view=diff
> ==============================================================================
> --- llvm/trunk/include/llvm/Support/MathExtras.h (original)
> +++ llvm/trunk/include/llvm/Support/MathExtras.h Mon Jan 27 05:12:24
> 2014
> @@ -552,6 +552,13 @@ inline uint64_t NextPowerOf2(uint64_t A)
>    return A + 1;
>  }
>  
> +/// Returns the power of two which is less than or equal to the
> given value.
> +/// Essentially, it is a floor operation across the domain of powers
> of two.
> +inline uint64_t PowerOf2Floor(uint64_t A) {
> +  if (!A) return 0;
> +  return 1ull << (63 - countLeadingZeros(A, ZB_Undefined));
> +}
> +
>  /// Returns the next integer (mod 2**64) that is greater than or
>  equal to
>  /// \p Value and is a multiple of \p Align. \p Align must be
>  non-zero.
>  ///
> 
> Modified: llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp
> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp?rev=200213&r1=200212&r2=200213&view=diff
> ==============================================================================
> --- llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp (original)
> +++ llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp Mon Jan 27
> 05:12:24 2014
> @@ -5004,8 +5004,11 @@ LoopVectorizationCostModel::selectUnroll
>    // registers. These registers are used by all of the unrolled
>    instances.
>    // Next, divide the remaining registers by the number of registers
>    that is
>    // required by the loop, in order to estimate how many parallel
>    instances
> -  // fit without causing spills.
> -  unsigned UF = (TargetNumRegisters - R.LoopInvariantRegs) /
> R.MaxLocalUsers;
> +  // fit without causing spills. All of this is rounded down if
> necessary to be
> +  // a power of two. We want power of two unroll factors to simplify
> any
> +  // addressing operations or alignment considerations.
> +  unsigned UF = PowerOf2Floor((TargetNumRegisters -
> R.LoopInvariantRegs) /
> +                              R.MaxLocalUsers);
>  
>    // Clamp the unroll factor ranges to reasonable factors.
>    unsigned MaxUnrollSize = TTI.getMaximumUnrollFactor();
> @@ -5045,7 +5048,7 @@ LoopVectorizationCostModel::selectUnroll
>    DEBUG(dbgs() << "LV: Loop cost is " << LoopCost << '\n');
>    if (LoopCost < SmallLoopCost) {
>      DEBUG(dbgs() << "LV: Unrolling to reduce branch cost.\n");
> -    unsigned NewUF = SmallLoopCost / (LoopCost + 1);
> +    unsigned NewUF = PowerOf2Floor(SmallLoopCost / LoopCost);
>      return std::min(NewUF, UF);
>    }
>  
> 
> Modified: llvm/trunk/test/Transforms/LoopVectorize/unroll_novec.ll
> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/LoopVectorize/unroll_novec.ll?rev=200213&r1=200212&r2=200213&view=diff
> ==============================================================================
> --- llvm/trunk/test/Transforms/LoopVectorize/unroll_novec.ll
> (original)
> +++ llvm/trunk/test/Transforms/LoopVectorize/unroll_novec.ll Mon Jan
> 27 05:12:24 2014
> @@ -1,4 +1,4 @@
> -; RUN: opt < %s  -loop-vectorize -force-vector-width=1
> -force-vector-unroll=2 -dce -instcombine -S | FileCheck %s
> +; RUN: opt < %s  -loop-vectorize -force-vector-width=1
> -force-target-num-scalar-regs=16 -force-target-max-scalar-unroll=8
> -small-loop-cost=20 -dce -instcombine -S | FileCheck %s
>  
>  target datalayout =
>  "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
>  target triple = "x86_64-apple-macosx10.8.0"
> @@ -12,10 +12,20 @@ target triple = "x86_64-apple-macosx10.8
>  ;CHECK-LABEL: @inc(
>  ;CHECK: load i32*
>  ;CHECK: load i32*
> +;CHECK: load i32*
> +;CHECK: load i32*
> +;CHECK-NOT: load i32*
> +;CHECK: add nsw i32
>  ;CHECK: add nsw i32
>  ;CHECK: add nsw i32
> +;CHECK: add nsw i32
> +;CHECK-NOT: add nsw i32
> +;CHECK: store i32
> +;CHECK: store i32
>  ;CHECK: store i32
>  ;CHECK: store i32
> +;CHECK-NOT: store i32
> +;CHECK: add i64 %{{.*}}, 4
>  ;CHECK: ret void
>  define void @inc(i32 %n) nounwind uwtable noinline ssp {
>    %1 = icmp sgt i32 %n, 0
> 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory