[PATCH] Fix PR 23525 - Separate header mass propagation in irregular loops.

Wed Jun 10 08:43:33 PDT 2015

On Tue, Jun 9, 2015 at 9:34 PM, Duncan P. N. Exon Smith <
dexonsmith at apple.com> wrote:

> Firstly, you didn't update the documentation, which currently has these
> two "TODOs" (in a list of four):
>   - Backedge frequencies are assumed to be evenly split between the
>     headers of a given irreducible SCC.  Instead, we could track the
>     backedge mass separately for each header, and adjust their relative
>     frequencies.
>   - Entry frequencies are assumed to be evenly split between the headers
>     of a given irreducible SCC, which is the only option if we need to
>     compute mass in the SCC before its parent loop.  Instead, we could
>     partially compute mass in the parent loop, and stop when we get to
>     the SCC.  Here, we have the correct ratio of entry masses, which we0
>     can use to adjust their relative frequencies.  Compute mass in the
>     SCC, and then continue propagation in the parent.
>
> Since you've implemented the backedge frequency point, please remove it,
> and update the rest of the doxygen to describe the current state.

Done.

As a follow up to this patch, I want to address the incoming mass
distribution into an irregular loop. Divvying up a fixed mass loses
precision.

> Secondly, I wonder whether this will regress cases we care about if you
> don't incorporate the entry frequency point.  E.g., your patch would
> make the frequencies *way* off for the following contrived example:
>
>     assert(n > 0);
>     int i = n - 2;
>     if (n % 2 == 0)
>       goto SeventyFive;
>
>     for (; i < n; ++i) {
>       if (i % n == 75) {
>     SeventyFive:
>         foo(i);
>       } else {
>         bar(i);
>       }
>     }
>
> whereas (I think) the current in-tree code would get it roughly correct
> (by fluke?).

Actually, my patch addresses this problem in particular.  This code is
almost exactly the test case in PR 23525. The in-tree code gets it wrong
the same way it gets PR 23525 wrong. The call to foo(i) is almost never
taken vs the call to bar(i).

We get the following frequencies for this code snippet (I wrapped it into a
function that I called 100 times).  The left column is current trunk.  The
right column is trunk plus my patch.  I redacted the floating point figures
to make them more readable:

   block-frequency-info: foo                   | block-frequency-info: foo
   - entry: float = 1.0, int = 8388608         | - entry: float = 1.0, int
= 8388608
   - cond.true: float = 0.99, int = 8388600    | - cond.true: float = 0.99,
int = 8388600
   - cond.false: float = 0.00, int = 8         | - cond.false: float =
0.00, int = 8
   - : float = 0.0, int = 0                    | - : float = 0.0, int = 0
   - cond.end: float = 0.99, int = 8388600     | - cond.end: float = 0.99,
int = 8388600
   - if.then: float = 0.49, int = 4152772      | - if.then: float = 0.49,
int = 4152772
   - if.end: float = 0.50, int = 4235827       | - if.end: float = 0.50,
int = 4235827
*=> - for.cond: float = 2.49, int = 20971499    | - for.cond: float = 2.88,
int = 24197884*
   - for.body: float = 1.49, int = 12582899    | - for.body: float = 1.49,
int = 12582899
*=> - if.then.5: float = 0.02, int = 249991     | - if.then.5: float =
0.02, int = 249991*
*=> - SeventyFive: float = 2.49, int = 20971499 | - SeventyFive: float =
0.03, int = 288451*
   - if.else: float = 1.47, int = 12332908     | - if.else: float = 1.47,
int = 12332908
   - if.end.7: float = 2.49, int = 20971499    | - if.end.7: float = 4.58,
int = 38428163
*=> - for.inc: float = 2.49, int = 20971499     | - for.inc: float = 2.49,
int = 20971499*
   - for.end: float = 0.99, int = 8388600      | - for.end: float = 0.99,
int = 8388600

I highlighted the key blocks.  In particular, blocks if.then.5 and
SeventyFive. In current trunk, if.then.5 is correctly given a very low
frequency (i % 75 is almost never 0), but block SeventyFive is given the
same frequency as for.cond and for.inc. That's wrong.

With this patch, we correctly compute SeventyFive's frequency to a fairly
low value.

> Has your patch improved the workload where the previous approximation
> was causing you problems?  Has it regressed any others?  What has it
> done for the LNT test-suite?

I'm in the process of running various benchmarks.  I'll run LNT too. It's
fixed the problem in our internal code, but I'm yet to run the full
benchmark.

> Thirdly, what's your intuition for the kind of approximation we really
> need?  Is this the right step forward?

Without thinking of a full re-implementation, I think the current model is
fine with a couple additional tweaks.  I don't think we need to go all out
in trying to model irreducible flow.  A couple of additional tweaks I'm
planning is to address your TODO notes on iterating a couple of times.

> As an historical note, irreducible edges were ignored entirely before I
> rewrote BFI, so the bar for improving/maintaining the irreducible
> control flow approximation was pretty low.  But I wonder whether
> shoe-horning irreducible control flow into loop data structures (as I've
> done, and this patch would continue) is the right approach.  Maybe they
> should be more first-class somehow?

Perhaps there is some coalescing of code we can do in treating them like
regular loops that happen to have more than one header node.

> For example, another approach for an irreducible subgraph: distribute
> mass from each "header" separately (in its own subgraph), thereby
> calculating the probability that we transition from each header to each
> other, giving us a probability matrix for a Markov Chain.  Then,
> *waves-hands-furiously*, out pop accurate block masses for the headers
> (and thusly the other nodes).

If we take the initial mass as inherited from the mass flowing into each
header, we could add a fixed number of iterations and call it a day.  I
*think* that should be enough.

> Given that you've already done the work, I'm happy for this to go in
> basically as is, provided that it doesn't cause compile-time or run-time
> regressions.  But I'd like to get a sense of where this is headed, and
> whether incrementally improving the approximation for irreducible
> control flow will address the use cases you care about.

Sure. I think that the current scheme will serve us well for now. As we
find more cases, we can re-evaluate. I don't really want to over-engineer
an approach to irreducible flow since it's not all that common (FLW ;)

>> @@ -411,8 +414,10 @@
>>
>>      // Check for a backedge.
>>      if (W.Type == Weight::Backedge) {
>> -      OuterLoop->BackedgeMass += Taken;
>> -      DEBUG(debugAssign(BlockNode(), Taken, "back"));
>> +      if (W.TargetNode.Index >= OuterLoop->BackedgeMass.size())
>> +        OuterLoop->BackedgeMass.resize(W.TargetNode.Index + 1);
>
> This doesn't make sense to me.  IIUC, you're using the index of the loop
> header in the RPOT of the function.  Even without irreducible control
> flow, this adds quadratic memory usage (scales with the number of blocks
times
> the number of loops, and there can be as many loops as there are blocks
> in the worst case).

Ah, yes.

> I think you have two options:
>
>   - Force the size of the `BackedgeMass` vector to be exactly
>     `NumHeaders`.  In this code, do a binary search through the range
>     between `NodeList.begin()` and `NodeList.begin() + NumHeaders` to
>     find the index of the header in `NodeList`, and use the same index
>     for `BackedgeMass`.  This will have logarithmic runtime cost in the
>     number of loop headers (locally), and linear memory requirements
>     overall.
>   - Use a `SmallDenseMap<BlockNode, BlockMass, 2>`.
>
> I prefer the former approach, unless you have some reason against it?

No particular preference. The map approach seems a tad more convenient to
implement, though. Seems to me that it would also make the code more
readable?

>> +#ifndef NDEBUG
>> +  auto debugAssign = [&](const BlockNode &T, const BlockMass &M,
>> +                         const char *Desc) {
>> +    dbgs() << "  => assign " << M << " (" << D.RemMass << ")";
>> +    if (Desc)
>> +      dbgs() << " [" << Desc << "]";
>> +    if (T.isValid())
>> +      dbgs() << " to " << getBlockName(T);
>> +    dbgs() << "\n";
>> +  };
>> +  (void)debugAssign;
>
> Looks like this has been duplicated.  Can it be moved to a `static`
> function?

Yeah, I was going to do this and forgot in then.

>
>> +#endif
>> +
>> +  DEBUG(dbgs() << " Distribute loop mass " << LoopMass
>> +               << " to headers using above weights\n");
>> +  for (const Weight &W : Dist.Weights) {
>> +    BlockMass Taken = D.takeMass(W.Amount);
>> +    assert(W.Type == Weight::Local && "all weights should be local");
>> +    Working[W.TargetNode.Index].getMass() = Taken;
>> +    DEBUG(debugAssign(W.TargetNode, Taken, nullptr));
>> +  }
>
> The way `Distribution` and `DitheringDistributer` work together is
> pretty heavy for what you're doing here (e.g., with the local vs.
> backedge vs. exit distinction).  Moreover, I think the simple behaviour
> needed here would be useful outside of BFI.  Do you have any ideas for
> how to restructure those classes and/or factor out the common logic?

Well, all I really wanted originally was to compute ratios that I should
use to alter the initial mass for each header, but there is no easy way to
alter a block mass by a floating point fraction. The mass distributor
seemed like the best approach.

In a followup patch I could move the mass distributor out. But where else
are you thinking that this would be useful?

Thanks.  Diego.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150610/14ae4364/attachment.html>