[PATCH] Fix PR 23525 - Separate header mass propagation in irregular loops.
Xinliang David Li
xinliangli at gmail.com
Wed Jun 10 09:28:06 PDT 2015
On Wed, Jun 10, 2015 at 8:43 AM, Diego Novillo <dnovillo at google.com> wrote:
> On Tue, Jun 9, 2015 at 9:34 PM, Duncan P. N. Exon Smith <
> dexonsmith at apple.com> wrote:
>
> > Firstly, you didn't update the documentation, which currently has these
> > two "TODOs" (in a list of four):
> > - Backedge frequencies are assumed to be evenly split between the
> > headers of a given irreducible SCC. Instead, we could track the
> > backedge mass separately for each header, and adjust their relative
> > frequencies.
> > - Entry frequencies are assumed to be evenly split between the headers
> > of a given irreducible SCC, which is the only option if we need to
> > compute mass in the SCC before its parent loop. Instead, we could
> > partially compute mass in the parent loop, and stop when we get to
> > the SCC. Here, we have the correct ratio of entry masses, which we0
> > can use to adjust their relative frequencies. Compute mass in the
> > SCC, and then continue propagation in the parent.
> >
> > Since you've implemented the backedge frequency point, please remove it,
> > and update the rest of the doxygen to describe the current state.
>
> Done.
>
> As a follow up to this patch, I want to address the incoming mass
> distribution into an irregular loop. Divvying up a fixed mass loses
> precision.
>
> > Secondly, I wonder whether this will regress cases we care about if you
> > don't incorporate the entry frequency point. E.g., your patch would
> > make the frequencies *way* off for the following contrived example:
> >
> > assert(n > 0);
> > int i = n - 2;
> > if (n % 2 == 0)
> > goto SeventyFive;
> >
> > for (; i < n; ++i) {
> > if (i % n == 75) {
> > SeventyFive:
> > foo(i);
> > } else {
> > bar(i);
> > }
> > }
> >
> > whereas (I think) the current in-tree code would get it roughly correct
> > (by fluke?).
>
> Actually, my patch addresses this problem in particular. This code is
> almost exactly the test case in PR 23525. The in-tree code gets it wrong
> the same way it gets PR 23525 wrong. The call to foo(i) is almost never
> taken vs the call to bar(i).
>
> We get the following frequencies for this code snippet (I wrapped it into
> a function that I called 100 times). The left column is current trunk.
> The right column is trunk plus my patch. I redacted the floating point
> figures to make them more readable:
>
>
Do you have a complete runnable example above?
David
> block-frequency-info: foo | block-frequency-info: foo
> - entry: float = 1.0, int = 8388608 | - entry: float = 1.0, int
> = 8388608
> - cond.true: float = 0.99, int = 8388600 | - cond.true: float =
> 0.99, int = 8388600
> - cond.false: float = 0.00, int = 8 | - cond.false: float =
> 0.00, int = 8
> - : float = 0.0, int = 0 | - : float = 0.0, int = 0
> - cond.end: float = 0.99, int = 8388600 | - cond.end: float = 0.99,
> int = 8388600
> - if.then: float = 0.49, int = 4152772 | - if.then: float = 0.49,
> int = 4152772
> - if.end: float = 0.50, int = 4235827 | - if.end: float = 0.50,
> int = 4235827
> *=> - for.cond: float = 2.49, int = 20971499 | - for.cond: float =
> 2.88, int = 24197884*
> - for.body: float = 1.49, int = 12582899 | - for.body: float = 1.49,
> int = 12582899
> *=> - if.then.5: float = 0.02, int = 249991 | - if.then.5: float =
> 0.02, int = 249991*
> *=> - SeventyFive: float = 2.49, int = 20971499 | - SeventyFive: float =
> 0.03, int = 288451*
> - if.else: float = 1.47, int = 12332908 | - if.else: float = 1.47,
> int = 12332908
> - if.end.7: float = 2.49, int = 20971499 | - if.end.7: float = 4.58,
> int = 38428163
> *=> - for.inc: float = 2.49, int = 20971499 | - for.inc: float = 2.49,
> int = 20971499*
> - for.end: float = 0.99, int = 8388600 | - for.end: float = 0.99,
> int = 8388600
>
> I highlighted the key blocks. In particular, blocks if.then.5 and
> SeventyFive. In current trunk, if.then.5 is correctly given a very low
> frequency (i % 75 is almost never 0), but block SeventyFive is given the
> same frequency as for.cond and for.inc. That's wrong.
>
> With this patch, we correctly compute SeventyFive's frequency to a fairly
> low value.
>
> > Has your patch improved the workload where the previous approximation
> > was causing you problems? Has it regressed any others? What has it
> > done for the LNT test-suite?
>
> I'm in the process of running various benchmarks. I'll run LNT too. It's
> fixed the problem in our internal code, but I'm yet to run the full
> benchmark.
>
>
> > Thirdly, what's your intuition for the kind of approximation we really
> > need? Is this the right step forward?
>
> Without thinking of a full re-implementation, I think the current model is
> fine with a couple additional tweaks. I don't think we need to go all out
> in trying to model irreducible flow. A couple of additional tweaks I'm
> planning is to address your TODO notes on iterating a couple of times.
>
> > As an historical note, irreducible edges were ignored entirely before I
> > rewrote BFI, so the bar for improving/maintaining the irreducible
> > control flow approximation was pretty low. But I wonder whether
> > shoe-horning irreducible control flow into loop data structures (as I've
> > done, and this patch would continue) is the right approach. Maybe they
> > should be more first-class somehow?
>
> Perhaps there is some coalescing of code we can do in treating them like
> regular loops that happen to have more than one header node.
>
> > For example, another approach for an irreducible subgraph: distribute
> > mass from each "header" separately (in its own subgraph), thereby
> > calculating the probability that we transition from each header to each
> > other, giving us a probability matrix for a Markov Chain. Then,
> > *waves-hands-furiously*, out pop accurate block masses for the headers
> > (and thusly the other nodes).
>
> If we take the initial mass as inherited from the mass flowing into each
> header, we could add a fixed number of iterations and call it a day. I
> *think* that should be enough.
>
> > Given that you've already done the work, I'm happy for this to go in
> > basically as is, provided that it doesn't cause compile-time or run-time
> > regressions. But I'd like to get a sense of where this is headed, and
> > whether incrementally improving the approximation for irreducible
> > control flow will address the use cases you care about.
>
> Sure. I think that the current scheme will serve us well for now. As we
> find more cases, we can re-evaluate. I don't really want to over-engineer
> an approach to irreducible flow since it's not all that common (FLW ;)
>
> >> @@ -411,8 +414,10 @@
> >>
> >> // Check for a backedge.
> >> if (W.Type == Weight::Backedge) {
> >> - OuterLoop->BackedgeMass += Taken;
> >> - DEBUG(debugAssign(BlockNode(), Taken, "back"));
> >> + if (W.TargetNode.Index >= OuterLoop->BackedgeMass.size())
> >> + OuterLoop->BackedgeMass.resize(W.TargetNode.Index + 1);
> >
> > This doesn't make sense to me. IIUC, you're using the index of the loop
> > header in the RPOT of the function. Even without irreducible control
> > flow, this adds quadratic memory usage (scales with the number of blocks
> times
> > the number of loops, and there can be as many loops as there are blocks
> > in the worst case).
>
> Ah, yes.
>
> > I think you have two options:
> >
> > - Force the size of the `BackedgeMass` vector to be exactly
> > `NumHeaders`. In this code, do a binary search through the range
> > between `NodeList.begin()` and `NodeList.begin() + NumHeaders` to
> > find the index of the header in `NodeList`, and use the same index
> > for `BackedgeMass`. This will have logarithmic runtime cost in the
> > number of loop headers (locally), and linear memory requirements
> > overall.
> > - Use a `SmallDenseMap<BlockNode, BlockMass, 2>`.
> >
> > I prefer the former approach, unless you have some reason against it?
>
> No particular preference. The map approach seems a tad more convenient to
> implement, though. Seems to me that it would also make the code more
> readable?
>
>
> >> +#ifndef NDEBUG
> >> + auto debugAssign = [&](const BlockNode &T, const BlockMass &M,
> >> + const char *Desc) {
> >> + dbgs() << " => assign " << M << " (" << D.RemMass << ")";
> >> + if (Desc)
> >> + dbgs() << " [" << Desc << "]";
> >> + if (T.isValid())
> >> + dbgs() << " to " << getBlockName(T);
> >> + dbgs() << "\n";
> >> + };
> >> + (void)debugAssign;
> >
> > Looks like this has been duplicated. Can it be moved to a `static`
> > function?
>
> Yeah, I was going to do this and forgot in then.
>
> >
> >> +#endif
> >> +
> >> + DEBUG(dbgs() << " Distribute loop mass " << LoopMass
> >> + << " to headers using above weights\n");
> >> + for (const Weight &W : Dist.Weights) {
> >> + BlockMass Taken = D.takeMass(W.Amount);
> >> + assert(W.Type == Weight::Local && "all weights should be local");
> >> + Working[W.TargetNode.Index].getMass() = Taken;
> >> + DEBUG(debugAssign(W.TargetNode, Taken, nullptr));
> >> + }
> >
> > The way `Distribution` and `DitheringDistributer` work together is
> > pretty heavy for what you're doing here (e.g., with the local vs.
> > backedge vs. exit distinction). Moreover, I think the simple behaviour
> > needed here would be useful outside of BFI. Do you have any ideas for
> > how to restructure those classes and/or factor out the common logic?
>
> Well, all I really wanted originally was to compute ratios that I should
> use to alter the initial mass for each header, but there is no easy way to
> alter a block mass by a floating point fraction. The mass distributor
> seemed like the best approach.
>
> In a followup patch I could move the mass distributor out. But where else
> are you thinking that this would be useful?
>
>
> Thanks. Diego.
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150610/ab0b6c93/attachment.html>
More information about the llvm-commits
mailing list