[llvm] r208255 - [X86TTI] Remove the unrolling branch limits

Wed May 7 16:31:02 PDT 2014

On May 7, 2014, at 3:25 PM, Hal Finkel <hfinkel at anl.gov> wrote:

> Author: hfinkel
> Date: Wed May  7 17:25:18 2014
> New Revision: 208255
> 
> URL: http://llvm.org/viewvc/llvm-project?rev=208255&view=rev
> Log:
> [X86TTI] Remove the unrolling branch limits
> 
> The loop stream detector (LSD) on modern Intel cores,
> which optimizes the
> execution of small loops, has limits on the number of taken branches in
> addition to uop-count limits (modern AMD cores have similar limits).
> Unfortunately, at the IR level, estimating the number of branches that will be
> taken is difficult. For one thing, it strongly depends on later passes (block
> placement, etc.). The original implementation took a conservative approach and
> limited the maximal BB DFS depth of the loop.  However, fairly-extensive
> benchmarking by several of us has revealed that

Hi Hal, 

I think that removing the branch count limit make sense.  Do you mind sharing the performance data?  Were there any regressions or performance wins? 

I am asking about the performance data because I am guessing that there were some benchmarks that benefited from this heuristics otherwise it wouldn’t have made it in.

Thanks,
Nadav

> this is the wrong approach. In
> fact, there are zero known cases where the branch limit prevents a detrimental
> unrolling (but plenty of cases where it does prevent beneficial unrolling).
> 
> While we could improve the current branch counting logic by incorporating
> branch probabilities, this further complication seems unjustified without a
> motivating regression. Instead, unless and until a regression appears, the
> branch counting will be removed.
> 
> Modified:
>    llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp
> 
> Modified: llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp
> URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp?rev=208255&r1=208254&r2=208255&view=diff
> ==============================================================================
> --- llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp (original)
> +++ llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp Wed May  7 17:25:18 2014
> @@ -41,10 +41,6 @@ UsePartialUnrolling("x86-use-partial-unr
> static cl::opt<unsigned>
> PartialUnrollingThreshold("x86-partial-unrolling-threshold", cl::init(0),
>   cl::desc("Threshold for X86 partial unrolling"), cl::Hidden);
> -static cl::opt<unsigned>
> -PartialUnrollingMaxBranches("x86-partial-max-branches", cl::init(2),
> -  cl::desc("Threshold for taken branches in X86 partial unrolling"),
> -  cl::Hidden);
> 
> namespace {
> 
> @@ -172,49 +168,38 @@ void X86TTI::getUnrollingPreferences(Loo
>   //  - The loop must have fewer than 16 branches
>   //  - The loop must have less than 40 uops in all executed loop branches
> 
> -  unsigned MaxBranches, MaxOps;
> +  // The number of taken branches in a loop is hard to estimate here, and
> +  // benchmarking has revealed that it is better not to be conservative when
> +  // estimating the branch count. As a result, we'll ignore the branch limits
> +  // until someone finds a case where it matters in practice.
> +
> +  unsigned MaxOps;
>   if (PartialUnrollingThreshold.getNumOccurrences() > 0) {
> -    MaxBranches = PartialUnrollingMaxBranches;
>     MaxOps = PartialUnrollingThreshold;
>   } else if (ST->isAtom()) {
>     // On the Atom, the throughput for taken branches is 2 cycles. For small
>     // simple loops, expand by a small factor to hide the backedge cost.
> -    MaxBranches = 2;
>     MaxOps = 10;
>   } else if (ST->hasFSGSBase() && ST->hasXOP() /* Steamroller and later */) {
> -    MaxBranches = 16;
>     MaxOps = 40;
>   } else if (ST->hasFMA4() /* Any other recent AMD */) {
>     return;
>   } else if (ST->hasAVX() || ST->hasSSE42() /* Nehalem and later */) {
> -    MaxBranches = 8;
>     MaxOps = 28;
>   } else if (ST->hasSSSE3() /* Intel Core */) {
> -    MaxBranches = 4;
>     MaxOps = 18;
>   } else {
>     return;
>   }
> 
> -  // Scan the loop: don't unroll loops with calls, and count the potential
> -  // number of taken branches (this is somewhat conservative because we're
> -  // counting all block transitions as potential branches while in reality some
> -  // of these will become implicit via block placement).
> -  unsigned MaxDepth = 0;
> -  for (df_iterator<BasicBlock*> DI = df_begin(L->getHeader()),
> -       DE = df_end(L->getHeader()); DI != DE;) {
> -    if (!L->contains(*DI)) {
> -      DI.skipChildren();
> -      continue;
> -    }
> -
> -    MaxDepth = std::max(MaxDepth, DI.getPathLength());
> -    if (MaxDepth > MaxBranches)
> -      return;
> -
> -    for (BasicBlock::iterator I = DI->begin(), IE = DI->end(); I != IE; ++I)
> -      if (isa<CallInst>(I) || isa<InvokeInst>(I)) {
> -        ImmutableCallSite CS(I);
> +  // Scan the loop: don't unroll loops with calls.
> +  for (Loop::block_iterator I = L->block_begin(), E = L->block_end();
> +       I != E; ++I) {
> +    BasicBlock *BB = *I;
> +
> +    for (BasicBlock::iterator J = BB->begin(), JE = BB->end(); J != JE; ++J)
> +      if (isa<CallInst>(J) || isa<InvokeInst>(J)) {
> +        ImmutableCallSite CS(J);
>         if (const Function *F = CS.getCalledFunction()) {
>           if (!isLoweredToCall(F))
>             continue;
> @@ -222,23 +207,11 @@ void X86TTI::getUnrollingPreferences(Loo
> 
>         return;
>       }
> -
> -    ++DI;
>   }
> 
>   // Enable runtime and partial unrolling up to the specified size.
>   UP.Partial = UP.Runtime = true;
>   UP.PartialThreshold = UP.PartialOptSizeThreshold = MaxOps;
> -
> -  // Set the maximum count based on the loop depth. The maximum number of
> -  // branches taken in a loop (including the backedge) is equal to the maximum
> -  // loop depth (the DFS path length from the loop header to any block in the
> -  // loop). When the loop is unrolled, this depth (except for the backedge
> -  // itself) is multiplied by the unrolling factor. This new unrolled depth
> -  // must be less than the target-specific maximum branch count (which limits
> -  // the number of taken branches in the uop buffer).
> -  if (MaxDepth > 1)
> -    UP.MaxCount = (MaxBranches-1)/(MaxDepth-1);
> }
> 
> unsigned X86TTI::getNumberOfRegisters(bool Vector) const {
> 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits