[llvm] r200219 - [vectorize] Initial version of respecting PGO in the vectorizer: treat

Mon Jan 27 13:06:34 PST 2014

Hi Chandler,

This change makes the unroller/vectorizer more conservative when used with static BFI. We will not unroll (after handling conditional stores) the hottest loop “quantum_toffoli” in libquantum (50% or so) using the static heuristic:

  http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00082

Best,
Arnold

---- Block Freqs ----
 entry = 1.0
  entry -> if.else = 0.375
  entry -> if.then = 0.625
 if.then = 0.625
  if.then -> if.end22 = 0.625
 if.else = 0.375
  if.else -> for.cond.preheader = 0.1406
  if.else -> if.end22 = 0.23437
 for.cond.preheader = 0.1406
  for.cond.preheader -> for.body.lr.ph = 0.08789
  for.cond.preheader -> for.end = 0.05273
 for.body.lr.ph = 0.08789                   ### Preheader in question
  for.body.lr.ph -> for.body = 0.08789
 for.body = 2.8125                          ### Loop in question
  for.body -> if.then13 = 1.40625
  for.body -> for.inc = 1.40625
 if.then13 = 1.40625
  if.then13 -> for.inc = 1.40625
 for.inc = 2.8125
  for.inc -> for.body = 2.7246
  for.inc -> for.end.loopexit = 0.08789
 for.end.loopexit = 0.08789
  for.end.loopexit -> for.end = 0.08789
 for.end = 0.1406
  for.end -> if.end22 = 0.1406
 if.end22 = 1.0

On Jan 27, 2014, at 5:11 AM, Chandler Carruth <chandlerc at gmail.com> wrote:

> Author: chandlerc
> Date: Mon Jan 27 07:11:50 2014
> New Revision: 200219
> 
> URL: http://llvm.org/viewvc/llvm-project?rev=200219&view=rev
> Log:
> [vectorize] Initial version of respecting PGO in the vectorizer: treat
> cold loops as-if they were being optimized for size.
> 
> Nothing fancy here. Simply test case included. The nice thing is that we
> can now incrementally build on top of this to drive other heuristics.
> All of the infrastructure work is done to get the profile information
> into this layer.
> 
> The remaining work necessary to make this a fully general purpose loop
> unroller for very hot loops is to make it a fully general purpose loop
> unroller. Things I know of but am not going to have time to benchmark
> and fix in the immediate future:
> 
> 1) Don't disable the entire pass when the target is lacking vector
>   registers. This really doesn't make any sense any more.
> 2) Teach the unroller at least and the vectorizer potentially to handle
>   non-if-converted loops. This is trivial for the unroller but hard for
>   the vectorizer.
> 3) Compute the relative hotness of the loop and thread that down to the
>   various places that make cost tradeoffs (very likely only the
>   unroller makes sense here, and then only when dealing with loops that
>   are small enough for unrolling to not completely blow out the LSD).
> 
> I'm still dubious how useful hotness information will be. So far, my
> experiments show that if we can get the correct logic for determining
> when unrolling actually helps performance, the code size impact is
> completely unimportant and we can unroll in all cases. But at least
> we'll no longer burn code size on cold code.
> 
> One somewhat unrelated idea that I've had forever but not had time to
> implement: mark all functions which are only reachable via the global
> constructors rigging in the module as optsize. This would also decrease
> the impact of any more aggressive heuristics here on code size.
> 
> Modified:
>    llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp
>    llvm/trunk/test/Transforms/LoopVectorize/X86/small-size.ll
> 
> Modified: llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp?rev=200219&r1=200218&r2=200219&view=diff
> ==============================================================================
> --- llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp (original)
> +++ llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp Mon Jan 27 07:11:50 2014
> @@ -56,6 +56,7 @@
> #include "llvm/ADT/SmallVector.h"
> #include "llvm/ADT/StringExtras.h"
> #include "llvm/Analysis/AliasAnalysis.h"
> +#include "llvm/Analysis/BlockFrequencyInfo.h"
> #include "llvm/Analysis/LoopInfo.h"
> #include "llvm/Analysis/LoopIterator.h"
> #include "llvm/Analysis/LoopPass.h"
> @@ -78,6 +79,7 @@
> #include "llvm/IR/Value.h"
> #include "llvm/IR/Verifier.h"
> #include "llvm/Pass.h"
> +#include "llvm/Support/BranchProbability.h"
> #include "llvm/Support/CommandLine.h"
> #include "llvm/Support/Debug.h"
> #include "llvm/Support/PatternMatch.h"
> @@ -980,18 +982,27 @@ struct LoopVectorize : public FunctionPa
>   LoopInfo *LI;
>   TargetTransformInfo *TTI;
>   DominatorTree *DT;
> +  BlockFrequencyInfo *BFI;
>   TargetLibraryInfo *TLI;
>   bool DisableUnrolling;
>   bool AlwaysVectorize;
> 
> +  BlockFrequency ColdEntryFreq;
> +
>   virtual bool runOnFunction(Function &F) {
>     SE = &getAnalysis<ScalarEvolution>();
>     DL = getAnalysisIfAvailable<DataLayout>();
>     LI = &getAnalysis<LoopInfo>();
>     TTI = &getAnalysis<TargetTransformInfo>();
>     DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
> +    BFI = &getAnalysis<BlockFrequencyInfo>();
>     TLI = getAnalysisIfAvailable<TargetLibraryInfo>();
> 
> +    // Compute some weights outside of the loop over the loops. Compute this
> +    // using a BranchProbability to re-use its scaling math.
> +    const BranchProbability ColdProb(1, 5); // 20%
> +    ColdEntryFreq = BlockFrequency(BFI->getEntryFreq()) * ColdProb;
> +
>     // If the target claims to have no vector registers don't attempt
>     // vectorization.
>     if (!TTI->getNumberOfRegisters(true))
> @@ -1064,6 +1075,13 @@ struct LoopVectorize : public FunctionPa
>     bool OptForSize =
>         Hints.Force != 1 && F->hasFnAttribute(Attribute::OptimizeForSize);
> 
> +    // Compute the weighted frequency of this loop being executed and see if it
> +    // is less than 20% of the function entry baseline frequency. Note that we
> +    // always have a canonical loop here because we think we *can* vectoriez.
> +    BlockFrequency LoopEntryFreq = BFI->getBlockFreq(L->getLoopPreheader());
> +    if (Hints.Force != 1 && LoopEntryFreq < ColdEntryFreq)
> +      OptForSize = true;
> +
>     // Check the function attributes to see if implicit floats are allowed.a
>     // FIXME: This check doesn't seem possibly correct -- what if the loop is
>     // an integer loop and the vector instructions selected are purely integer
> @@ -1109,6 +1127,7 @@ struct LoopVectorize : public FunctionPa
>   virtual void getAnalysisUsage(AnalysisUsage &AU) const {
>     AU.addRequiredID(LoopSimplifyID);
>     AU.addRequiredID(LCSSAID);
> +    AU.addRequired<BlockFrequencyInfo>();
>     AU.addRequired<DominatorTreeWrapperPass>();
>     AU.addRequired<LoopInfo>();
>     AU.addRequired<ScalarEvolution>();
> @@ -5469,6 +5488,7 @@ char LoopVectorize::ID = 0;
> static const char lv_name[] = "Loop Vectorization";
> INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false)
> INITIALIZE_AG_DEPENDENCY(TargetTransformInfo)
> +INITIALIZE_PASS_DEPENDENCY(BlockFrequencyInfo)
> INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
> INITIALIZE_PASS_DEPENDENCY(ScalarEvolution)
> INITIALIZE_PASS_DEPENDENCY(LCSSA)
> 
> Modified: llvm/trunk/test/Transforms/LoopVectorize/X86/small-size.ll
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/LoopVectorize/X86/small-size.ll?rev=200219&r1=200218&r2=200219&view=diff
> ==============================================================================
> --- llvm/trunk/test/Transforms/LoopVectorize/X86/small-size.ll (original)
> +++ llvm/trunk/test/Transforms/LoopVectorize/X86/small-size.ll Mon Jan 27 07:11:50 2014
> @@ -115,6 +115,31 @@ define void @example3(i32 %n, i32* noali
>   ret void
> }
> 
> +; N is unknown, we need a tail. Can't vectorize because the loop is cold.
> +;CHECK-LABEL: @example4(
> +;CHECK-NOT: <4 x i32>
> +;CHECK: ret void
> +define void @example4(i32 %n, i32* noalias nocapture %p, i32* noalias nocapture %q) {
> +  %1 = icmp eq i32 %n, 0
> +  br i1 %1, label %._crit_edge, label %.lr.ph, !prof !0
> +
> +.lr.ph:                                           ; preds = %0, %.lr.ph
> +  %.05 = phi i32 [ %2, %.lr.ph ], [ %n, %0 ]
> +  %.014 = phi i32* [ %5, %.lr.ph ], [ %p, %0 ]
> +  %.023 = phi i32* [ %3, %.lr.ph ], [ %q, %0 ]
> +  %2 = add nsw i32 %.05, -1
> +  %3 = getelementptr inbounds i32* %.023, i64 1
> +  %4 = load i32* %.023, align 16
> +  %5 = getelementptr inbounds i32* %.014, i64 1
> +  store i32 %4, i32* %.014, align 16
> +  %6 = icmp eq i32 %2, 0
> +  br i1 %6, label %._crit_edge, label %.lr.ph
> +
> +._crit_edge:                                      ; preds = %.lr.ph, %0
> +  ret void
> +}
> +
> +!0 = metadata !{metadata !"branch_weights", i32 64, i32 4}
> 
> ; We can't vectorize this one because we need a runtime ptr check.
> ;CHECK-LABEL: @example23(
> 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits