[llvm] r208255 - [X86TTI] Remove the unrolling branch limits

Wed May 7 16:35:18 PDT 2014

----- Original Message -----
> From: "Nadav Rotem" <nrotem at apple.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: llvm-commits at cs.uiuc.edu
> Sent: Wednesday, May 7, 2014 6:31:02 PM
> Subject: Re: [llvm] r208255 - [X86TTI] Remove the unrolling branch limits
> 
> 
> On May 7, 2014, at 3:25 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
> > Author: hfinkel
> > Date: Wed May  7 17:25:18 2014
> > New Revision: 208255
> > 
> > URL: http://llvm.org/viewvc/llvm-project?rev=208255&view=rev
> > Log:
> > [X86TTI] Remove the unrolling branch limits
> > 
> > The loop stream detector (LSD) on modern Intel cores,
> > which optimizes the
> > execution of small loops, has limits on the number of taken
> > branches in
> > addition to uop-count limits (modern AMD cores have similar
> > limits).
> > Unfortunately, at the IR level, estimating the number of branches
> > that will be
> > taken is difficult. For one thing, it strongly depends on later
> > passes (block
> > placement, etc.). The original implementation took a conservative
> > approach and
> > limited the maximal BB DFS depth of the loop.  However,
> > fairly-extensive
> > benchmarking by several of us has revealed that
> 
> Hi Hal,
> 
> I think that removing the branch count limit make sense.  Do you mind
> sharing the performance data?  Were there any regressions or
> performance wins?
> 
> I am asking about the performance data because I am guessing that
> there were some benchmarks that benefited from this heuristics
> otherwise it wouldn’t have made it in.

Chandler and Louis did most of the benchmarking (I only confirmed that I saw no regressions); I'll let them comment (some of Louis's results are also discussed in another thread re: r207940).

 -Hal

> 
> Thanks,
> Nadav
> 
> 
> > this is the wrong approach. In
> > fact, there are zero known cases where the branch limit prevents a
> > detrimental
> > unrolling (but plenty of cases where it does prevent beneficial
> > unrolling).
> > 
> > While we could improve the current branch counting logic by
> > incorporating
> > branch probabilities, this further complication seems unjustified
> > without a
> > motivating regression. Instead, unless and until a regression
> > appears, the
> > branch counting will be removed.
> > 
> > Modified:
> >    llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp
> > 
> > Modified: llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp
> > URL:
> > http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp?rev=208255&r1=208254&r2=208255&view=diff
> > ==============================================================================
> > --- llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp (original)
> > +++ llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp Wed May  7
> > 17:25:18 2014
> > @@ -41,10 +41,6 @@ UsePartialUnrolling("x86-use-partial-unr
> > static cl::opt<unsigned>
> > PartialUnrollingThreshold("x86-partial-unrolling-threshold",
> > cl::init(0),
> >   cl::desc("Threshold for X86 partial unrolling"), cl::Hidden);
> > -static cl::opt<unsigned>
> > -PartialUnrollingMaxBranches("x86-partial-max-branches",
> > cl::init(2),
> > -  cl::desc("Threshold for taken branches in X86 partial
> > unrolling"),
> > -  cl::Hidden);
> > 
> > namespace {
> > 
> > @@ -172,49 +168,38 @@ void X86TTI::getUnrollingPreferences(Loo
> >   //  - The loop must have fewer than 16 branches
> >   //  - The loop must have less than 40 uops in all executed loop
> >   branches
> > 
> > -  unsigned MaxBranches, MaxOps;
> > +  // The number of taken branches in a loop is hard to estimate
> > here, and
> > +  // benchmarking has revealed that it is better not to be
> > conservative when
> > +  // estimating the branch count. As a result, we'll ignore the
> > branch limits
> > +  // until someone finds a case where it matters in practice.
> > +
> > +  unsigned MaxOps;
> >   if (PartialUnrollingThreshold.getNumOccurrences() > 0) {
> > -    MaxBranches = PartialUnrollingMaxBranches;
> >     MaxOps = PartialUnrollingThreshold;
> >   } else if (ST->isAtom()) {
> >     // On the Atom, the throughput for taken branches is 2 cycles.
> >     For small
> >     // simple loops, expand by a small factor to hide the backedge
> >     cost.
> > -    MaxBranches = 2;
> >     MaxOps = 10;
> >   } else if (ST->hasFSGSBase() && ST->hasXOP() /* Steamroller and
> >   later */) {
> > -    MaxBranches = 16;
> >     MaxOps = 40;
> >   } else if (ST->hasFMA4() /* Any other recent AMD */) {
> >     return;
> >   } else if (ST->hasAVX() || ST->hasSSE42() /* Nehalem and later
> >   */) {
> > -    MaxBranches = 8;
> >     MaxOps = 28;
> >   } else if (ST->hasSSSE3() /* Intel Core */) {
> > -    MaxBranches = 4;
> >     MaxOps = 18;
> >   } else {
> >     return;
> >   }
> > 
> > -  // Scan the loop: don't unroll loops with calls, and count the
> > potential
> > -  // number of taken branches (this is somewhat conservative
> > because we're
> > -  // counting all block transitions as potential branches while in
> > reality some
> > -  // of these will become implicit via block placement).
> > -  unsigned MaxDepth = 0;
> > -  for (df_iterator<BasicBlock*> DI = df_begin(L->getHeader()),
> > -       DE = df_end(L->getHeader()); DI != DE;) {
> > -    if (!L->contains(*DI)) {
> > -      DI.skipChildren();
> > -      continue;
> > -    }
> > -
> > -    MaxDepth = std::max(MaxDepth, DI.getPathLength());
> > -    if (MaxDepth > MaxBranches)
> > -      return;
> > -
> > -    for (BasicBlock::iterator I = DI->begin(), IE = DI->end(); I
> > != IE; ++I)
> > -      if (isa<CallInst>(I) || isa<InvokeInst>(I)) {
> > -        ImmutableCallSite CS(I);
> > +  // Scan the loop: don't unroll loops with calls.
> > +  for (Loop::block_iterator I = L->block_begin(), E =
> > L->block_end();
> > +       I != E; ++I) {
> > +    BasicBlock *BB = *I;
> > +
> > +    for (BasicBlock::iterator J = BB->begin(), JE = BB->end(); J
> > != JE; ++J)
> > +      if (isa<CallInst>(J) || isa<InvokeInst>(J)) {
> > +        ImmutableCallSite CS(J);
> >         if (const Function *F = CS.getCalledFunction()) {
> >           if (!isLoweredToCall(F))
> >             continue;
> > @@ -222,23 +207,11 @@ void X86TTI::getUnrollingPreferences(Loo
> > 
> >         return;
> >       }
> > -
> > -    ++DI;
> >   }
> > 
> >   // Enable runtime and partial unrolling up to the specified size.
> >   UP.Partial = UP.Runtime = true;
> >   UP.PartialThreshold = UP.PartialOptSizeThreshold = MaxOps;
> > -
> > -  // Set the maximum count based on the loop depth. The maximum
> > number of
> > -  // branches taken in a loop (including the backedge) is equal to
> > the maximum
> > -  // loop depth (the DFS path length from the loop header to any
> > block in the
> > -  // loop). When the loop is unrolled, this depth (except for the
> > backedge
> > -  // itself) is multiplied by the unrolling factor. This new
> > unrolled depth
> > -  // must be less than the target-specific maximum branch count
> > (which limits
> > -  // the number of taken branches in the uop buffer).
> > -  if (MaxDepth > 1)
> > -    UP.MaxCount = (MaxBranches-1)/(MaxDepth-1);
> > }
> > 
> > unsigned X86TTI::getNumberOfRegisters(bool Vector) const {
> > 
> > 
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory