[PATCH] Fix PR 23525 - Separate header mass propagation in irregular loops.

Wed Jun 10 09:28:06 PDT 2015

On Wed, Jun 10, 2015 at 8:43 AM, Diego Novillo <dnovillo at google.com> wrote:

> On Tue, Jun 9, 2015 at 9:34 PM, Duncan P. N. Exon Smith <
> dexonsmith at apple.com> wrote:
>
> > Firstly, you didn't update the documentation, which currently has these
> > two "TODOs" (in a list of four):
> >   - Backedge frequencies are assumed to be evenly split between the
> >     headers of a given irreducible SCC.  Instead, we could track the
> >     backedge mass separately for each header, and adjust their relative
> >     frequencies.
> >   - Entry frequencies are assumed to be evenly split between the headers
> >     of a given irreducible SCC, which is the only option if we need to
> >     compute mass in the SCC before its parent loop.  Instead, we could
> >     partially compute mass in the parent loop, and stop when we get to
> >     the SCC.  Here, we have the correct ratio of entry masses, which we0
> >     can use to adjust their relative frequencies.  Compute mass in the
> >     SCC, and then continue propagation in the parent.
> >
> > Since you've implemented the backedge frequency point, please remove it,
> > and update the rest of the doxygen to describe the current state.
>
> Done.
>
> As a follow up to this patch, I want to address the incoming mass
> distribution into an irregular loop. Divvying up a fixed mass loses
> precision.
>
> > Secondly, I wonder whether this will regress cases we care about if you
> > don't incorporate the entry frequency point.  E.g., your patch would
> > make the frequencies *way* off for the following contrived example:
> >
> >     assert(n > 0);
> >     int i = n - 2;
> >     if (n % 2 == 0)
> >       goto SeventyFive;
> >
> >     for (; i < n; ++i) {
> >       if (i % n == 75) {
> >     SeventyFive:
> >         foo(i);
> >       } else {
> >         bar(i);
> >       }
> >     }
> >
> > whereas (I think) the current in-tree code would get it roughly correct
> > (by fluke?).
>
> Actually, my patch addresses this problem in particular.  This code is
> almost exactly the test case in PR 23525. The in-tree code gets it wrong
> the same way it gets PR 23525 wrong. The call to foo(i) is almost never
> taken vs the call to bar(i).
>
> We get the following frequencies for this code snippet (I wrapped it into
> a function that I called 100 times).  The left column is current trunk.
> The right column is trunk plus my patch.  I redacted the floating point
> figures to make them more readable:
>
>

Do you have a complete runnable example above?

David

>    block-frequency-info: foo                   | block-frequency-info: foo
>    - entry: float = 1.0, int = 8388608         | - entry: float = 1.0, int
> = 8388608
>    - cond.true: float = 0.99, int = 8388600    | - cond.true: float =
> 0.99, int = 8388600
>    - cond.false: float = 0.00, int = 8         | - cond.false: float =
> 0.00, int = 8
>    - : float = 0.0, int = 0                    | - : float = 0.0, int = 0
>    - cond.end: float = 0.99, int = 8388600     | - cond.end: float = 0.99,
> int = 8388600
>    - if.then: float = 0.49, int = 4152772      | - if.then: float = 0.49,
> int = 4152772
>    - if.end: float = 0.50, int = 4235827       | - if.end: float = 0.50,
> int = 4235827
> *=> - for.cond: float = 2.49, int = 20971499    | - for.cond: float =
> 2.88, int = 24197884*
>    - for.body: float = 1.49, int = 12582899    | - for.body: float = 1.49,
> int = 12582899
> *=> - if.then.5: float = 0.02, int = 249991     | - if.then.5: float =
> 0.02, int = 249991*
> *=> - SeventyFive: float = 2.49, int = 20971499 | - SeventyFive: float =
> 0.03, int = 288451*
>    - if.else: float = 1.47, int = 12332908     | - if.else: float = 1.47,
> int = 12332908
>    - if.end.7: float = 2.49, int = 20971499    | - if.end.7: float = 4.58,
> int = 38428163
> *=> - for.inc: float = 2.49, int = 20971499     | - for.inc: float = 2.49,
> int = 20971499*
>    - for.end: float = 0.99, int = 8388600      | - for.end: float = 0.99,
> int = 8388600
>
> I highlighted the key blocks.  In particular, blocks if.then.5 and
> SeventyFive. In current trunk, if.then.5 is correctly given a very low
> frequency (i % 75 is almost never 0), but block SeventyFive is given the
> same frequency as for.cond and for.inc. That's wrong.
>
> With this patch, we correctly compute SeventyFive's frequency to a fairly
> low value.
>
> > Has your patch improved the workload where the previous approximation
> > was causing you problems?  Has it regressed any others?  What has it
> > done for the LNT test-suite?
>
> I'm in the process of running various benchmarks.  I'll run LNT too. It's
> fixed the problem in our internal code, but I'm yet to run the full
> benchmark.
>
>
> > Thirdly, what's your intuition for the kind of approximation we really
> > need?  Is this the right step forward?
>
> Without thinking of a full re-implementation, I think the current model is
> fine with a couple additional tweaks.  I don't think we need to go all out
> in trying to model irreducible flow.  A couple of additional tweaks I'm
> planning is to address your TODO notes on iterating a couple of times.
>
> > As an historical note, irreducible edges were ignored entirely before I
> > rewrote BFI, so the bar for improving/maintaining the irreducible
> > control flow approximation was pretty low.  But I wonder whether
> > shoe-horning irreducible control flow into loop data structures (as I've
> > done, and this patch would continue) is the right approach.  Maybe they
> > should be more first-class somehow?
>
> Perhaps there is some coalescing of code we can do in treating them like
> regular loops that happen to have more than one header node.
>
> > For example, another approach for an irreducible subgraph: distribute
> > mass from each "header" separately (in its own subgraph), thereby
> > calculating the probability that we transition from each header to each
> > other, giving us a probability matrix for a Markov Chain.  Then,
> > *waves-hands-furiously*, out pop accurate block masses for the headers
> > (and thusly the other nodes).
>
> If we take the initial mass as inherited from the mass flowing into each
> header, we could add a fixed number of iterations and call it a day.  I
> *think* that should be enough.
>
> > Given that you've already done the work, I'm happy for this to go in
> > basically as is, provided that it doesn't cause compile-time or run-time
> > regressions.  But I'd like to get a sense of where this is headed, and
> > whether incrementally improving the approximation for irreducible
> > control flow will address the use cases you care about.
>
> Sure. I think that the current scheme will serve us well for now. As we
> find more cases, we can re-evaluate. I don't really want to over-engineer
> an approach to irreducible flow since it's not all that common (FLW ;)
>
> >> @@ -411,8 +414,10 @@
> >>
> >>      // Check for a backedge.
> >>      if (W.Type == Weight::Backedge) {
> >> -      OuterLoop->BackedgeMass += Taken;
> >> -      DEBUG(debugAssign(BlockNode(), Taken, "back"));
> >> +      if (W.TargetNode.Index >= OuterLoop->BackedgeMass.size())
> >> +        OuterLoop->BackedgeMass.resize(W.TargetNode.Index + 1);
> >
> > This doesn't make sense to me.  IIUC, you're using the index of the loop
> > header in the RPOT of the function.  Even without irreducible control
> > flow, this adds quadratic memory usage (scales with the number of blocks
> times
> > the number of loops, and there can be as many loops as there are blocks
> > in the worst case).
>
> Ah, yes.
>
> > I think you have two options:
> >
> >   - Force the size of the `BackedgeMass` vector to be exactly
> >     `NumHeaders`.  In this code, do a binary search through the range
> >     between `NodeList.begin()` and `NodeList.begin() + NumHeaders` to
> >     find the index of the header in `NodeList`, and use the same index
> >     for `BackedgeMass`.  This will have logarithmic runtime cost in the
> >     number of loop headers (locally), and linear memory requirements
> >     overall.
> >   - Use a `SmallDenseMap<BlockNode, BlockMass, 2>`.
> >
> > I prefer the former approach, unless you have some reason against it?
>
> No particular preference. The map approach seems a tad more convenient to
> implement, though. Seems to me that it would also make the code more
> readable?
>
>
> >> +#ifndef NDEBUG
> >> +  auto debugAssign = [&](const BlockNode &T, const BlockMass &M,
> >> +                         const char *Desc) {
> >> +    dbgs() << "  => assign " << M << " (" << D.RemMass << ")";
> >> +    if (Desc)
> >> +      dbgs() << " [" << Desc << "]";
> >> +    if (T.isValid())
> >> +      dbgs() << " to " << getBlockName(T);
> >> +    dbgs() << "\n";
> >> +  };
> >> +  (void)debugAssign;
> >
> > Looks like this has been duplicated.  Can it be moved to a `static`
> > function?
>
> Yeah, I was going to do this and forgot in then.
>
> >
> >> +#endif
> >> +
> >> +  DEBUG(dbgs() << " Distribute loop mass " << LoopMass
> >> +               << " to headers using above weights\n");
> >> +  for (const Weight &W : Dist.Weights) {
> >> +    BlockMass Taken = D.takeMass(W.Amount);
> >> +    assert(W.Type == Weight::Local && "all weights should be local");
> >> +    Working[W.TargetNode.Index].getMass() = Taken;
> >> +    DEBUG(debugAssign(W.TargetNode, Taken, nullptr));
> >> +  }
> >
> > The way `Distribution` and `DitheringDistributer` work together is
> > pretty heavy for what you're doing here (e.g., with the local vs.
> > backedge vs. exit distinction).  Moreover, I think the simple behaviour
> > needed here would be useful outside of BFI.  Do you have any ideas for
> > how to restructure those classes and/or factor out the common logic?
>
> Well, all I really wanted originally was to compute ratios that I should
> use to alter the initial mass for each header, but there is no easy way to
> alter a block mass by a floating point fraction. The mass distributor
> seemed like the best approach.
>
> In a followup patch I could move the mass distributor out. But where else
> are you thinking that this would be useful?
>
>
> Thanks.  Diego.
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150610/ab0b6c93/attachment.html>