[llvm] r208255 - [X86TTI] Remove the unrolling branch limits
Hal Finkel
hfinkel at anl.gov
Wed May 7 16:35:18 PDT 2014
----- Original Message -----
> From: "Nadav Rotem" <nrotem at apple.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: llvm-commits at cs.uiuc.edu
> Sent: Wednesday, May 7, 2014 6:31:02 PM
> Subject: Re: [llvm] r208255 - [X86TTI] Remove the unrolling branch limits
>
>
> On May 7, 2014, at 3:25 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> > Author: hfinkel
> > Date: Wed May 7 17:25:18 2014
> > New Revision: 208255
> >
> > URL: http://llvm.org/viewvc/llvm-project?rev=208255&view=rev
> > Log:
> > [X86TTI] Remove the unrolling branch limits
> >
> > The loop stream detector (LSD) on modern Intel cores,
> > which optimizes the
> > execution of small loops, has limits on the number of taken
> > branches in
> > addition to uop-count limits (modern AMD cores have similar
> > limits).
> > Unfortunately, at the IR level, estimating the number of branches
> > that will be
> > taken is difficult. For one thing, it strongly depends on later
> > passes (block
> > placement, etc.). The original implementation took a conservative
> > approach and
> > limited the maximal BB DFS depth of the loop. However,
> > fairly-extensive
> > benchmarking by several of us has revealed that
>
> Hi Hal,
>
> I think that removing the branch count limit make sense. Do you mind
> sharing the performance data? Were there any regressions or
> performance wins?
>
> I am asking about the performance data because I am guessing that
> there were some benchmarks that benefited from this heuristics
> otherwise it wouldn’t have made it in.
Chandler and Louis did most of the benchmarking (I only confirmed that I saw no regressions); I'll let them comment (some of Louis's results are also discussed in another thread re: r207940).
-Hal
>
> Thanks,
> Nadav
>
>
> > this is the wrong approach. In
> > fact, there are zero known cases where the branch limit prevents a
> > detrimental
> > unrolling (but plenty of cases where it does prevent beneficial
> > unrolling).
> >
> > While we could improve the current branch counting logic by
> > incorporating
> > branch probabilities, this further complication seems unjustified
> > without a
> > motivating regression. Instead, unless and until a regression
> > appears, the
> > branch counting will be removed.
> >
> > Modified:
> > llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp
> >
> > Modified: llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp
> > URL:
> > http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp?rev=208255&r1=208254&r2=208255&view=diff
> > ==============================================================================
> > --- llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp (original)
> > +++ llvm/trunk/lib/Target/X86/X86TargetTransformInfo.cpp Wed May 7
> > 17:25:18 2014
> > @@ -41,10 +41,6 @@ UsePartialUnrolling("x86-use-partial-unr
> > static cl::opt<unsigned>
> > PartialUnrollingThreshold("x86-partial-unrolling-threshold",
> > cl::init(0),
> > cl::desc("Threshold for X86 partial unrolling"), cl::Hidden);
> > -static cl::opt<unsigned>
> > -PartialUnrollingMaxBranches("x86-partial-max-branches",
> > cl::init(2),
> > - cl::desc("Threshold for taken branches in X86 partial
> > unrolling"),
> > - cl::Hidden);
> >
> > namespace {
> >
> > @@ -172,49 +168,38 @@ void X86TTI::getUnrollingPreferences(Loo
> > // - The loop must have fewer than 16 branches
> > // - The loop must have less than 40 uops in all executed loop
> > branches
> >
> > - unsigned MaxBranches, MaxOps;
> > + // The number of taken branches in a loop is hard to estimate
> > here, and
> > + // benchmarking has revealed that it is better not to be
> > conservative when
> > + // estimating the branch count. As a result, we'll ignore the
> > branch limits
> > + // until someone finds a case where it matters in practice.
> > +
> > + unsigned MaxOps;
> > if (PartialUnrollingThreshold.getNumOccurrences() > 0) {
> > - MaxBranches = PartialUnrollingMaxBranches;
> > MaxOps = PartialUnrollingThreshold;
> > } else if (ST->isAtom()) {
> > // On the Atom, the throughput for taken branches is 2 cycles.
> > For small
> > // simple loops, expand by a small factor to hide the backedge
> > cost.
> > - MaxBranches = 2;
> > MaxOps = 10;
> > } else if (ST->hasFSGSBase() && ST->hasXOP() /* Steamroller and
> > later */) {
> > - MaxBranches = 16;
> > MaxOps = 40;
> > } else if (ST->hasFMA4() /* Any other recent AMD */) {
> > return;
> > } else if (ST->hasAVX() || ST->hasSSE42() /* Nehalem and later
> > */) {
> > - MaxBranches = 8;
> > MaxOps = 28;
> > } else if (ST->hasSSSE3() /* Intel Core */) {
> > - MaxBranches = 4;
> > MaxOps = 18;
> > } else {
> > return;
> > }
> >
> > - // Scan the loop: don't unroll loops with calls, and count the
> > potential
> > - // number of taken branches (this is somewhat conservative
> > because we're
> > - // counting all block transitions as potential branches while in
> > reality some
> > - // of these will become implicit via block placement).
> > - unsigned MaxDepth = 0;
> > - for (df_iterator<BasicBlock*> DI = df_begin(L->getHeader()),
> > - DE = df_end(L->getHeader()); DI != DE;) {
> > - if (!L->contains(*DI)) {
> > - DI.skipChildren();
> > - continue;
> > - }
> > -
> > - MaxDepth = std::max(MaxDepth, DI.getPathLength());
> > - if (MaxDepth > MaxBranches)
> > - return;
> > -
> > - for (BasicBlock::iterator I = DI->begin(), IE = DI->end(); I
> > != IE; ++I)
> > - if (isa<CallInst>(I) || isa<InvokeInst>(I)) {
> > - ImmutableCallSite CS(I);
> > + // Scan the loop: don't unroll loops with calls.
> > + for (Loop::block_iterator I = L->block_begin(), E =
> > L->block_end();
> > + I != E; ++I) {
> > + BasicBlock *BB = *I;
> > +
> > + for (BasicBlock::iterator J = BB->begin(), JE = BB->end(); J
> > != JE; ++J)
> > + if (isa<CallInst>(J) || isa<InvokeInst>(J)) {
> > + ImmutableCallSite CS(J);
> > if (const Function *F = CS.getCalledFunction()) {
> > if (!isLoweredToCall(F))
> > continue;
> > @@ -222,23 +207,11 @@ void X86TTI::getUnrollingPreferences(Loo
> >
> > return;
> > }
> > -
> > - ++DI;
> > }
> >
> > // Enable runtime and partial unrolling up to the specified size.
> > UP.Partial = UP.Runtime = true;
> > UP.PartialThreshold = UP.PartialOptSizeThreshold = MaxOps;
> > -
> > - // Set the maximum count based on the loop depth. The maximum
> > number of
> > - // branches taken in a loop (including the backedge) is equal to
> > the maximum
> > - // loop depth (the DFS path length from the loop header to any
> > block in the
> > - // loop). When the loop is unrolled, this depth (except for the
> > backedge
> > - // itself) is multiplied by the unrolling factor. This new
> > unrolled depth
> > - // must be less than the target-specific maximum branch count
> > (which limits
> > - // the number of taken branches in the uop buffer).
> > - if (MaxDepth > 1)
> > - UP.MaxCount = (MaxBranches-1)/(MaxDepth-1);
> > }
> >
> > unsigned X86TTI::getNumberOfRegisters(bool Vector) const {
> >
> >
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
More information about the llvm-commits
mailing list