[PATCH] Fix PR 23525 - Separate header mass propagation in irregular loops.

Tue Jun 9 18:34:17 PDT 2015

> On 2015-Jun-09, at 15:07, Diego Novillo <dnovillo at google.com> wrote:
> 
> Hi dexonsmith,
> 
> When propagating mass through irregular loops, the mass flowing through
> each loop header may not be equal. This was causing wrong frequencies
> to be computed for irregular loop headers.
> 
> Fixed by keeping track of masses flowing through each of the headers in
> an irregular loop. To do this, we now keep track of per-header backedge
> weights. After the loop mass is distributed through the loop, the
> backedge weights are used to re-distribute the loop mass to the loop
> headers.
> 
> Since each backedge will have a mass proportional to the different
> branch weights, the loop headers will end up with a more approximate
> weight distribution (as opposed to the current distribution that assumes
> that every loop header is the same).
> 
> http://reviews.llvm.org/D10348
> 
> Files:
>  include/llvm/Analysis/BlockFrequencyInfoImpl.h
>  lib/Analysis/BlockFrequencyInfoImpl.cpp
>  test/Analysis/BlockFrequencyInfo/PR23525.ll
>  test/Analysis/BlockFrequencyInfo/irreducible.ll
> 
> EMAIL PREFERENCES
>  http://reviews.llvm.org/settings/panel/emailpreferences/
> <D10348.27407.patch>

Starting with a few high-level comments/questions.

Firstly, you didn't update the documentation, which currently has these
two "TODOs" (in a list of four):
  - Backedge frequencies are assumed to be evenly split between the
    headers of a given irreducible SCC.  Instead, we could track the
    backedge mass separately for each header, and adjust their relative
    frequencies.
  - Entry frequencies are assumed to be evenly split between the headers
    of a given irreducible SCC, which is the only option if we need to
    compute mass in the SCC before its parent loop.  Instead, we could
    partially compute mass in the parent loop, and stop when we get to
    the SCC.  Here, we have the correct ratio of entry masses, which we
    can use to adjust their relative frequencies.  Compute mass in the
    SCC, and then continue propagation in the parent.

Since you've implemented the backedge frequency point, please remove it,
and update the rest of the doxygen to describe the current state.

Secondly, I wonder whether this will regress cases we care about if you
don't incorporate the entry frequency point.  E.g., your patch would
make the frequencies *way* off for the following contrived example:

    assert(n > 0);
    int i = n - 2;
    if (n % 2 == 0)
      goto SeventyFive;

    for (; i < n; ++i) {
      if (i % n == 75) {
    SeventyFive:
        foo(i);
      } else {
        bar(i);
      }
    }

whereas (I think) the current in-tree code would get it roughly correct
(by fluke?).

Has your patch improved the workload where the previous approximation
was causing you problems?  Has it regressed any others?  What has it
done for the LNT test-suite?

Thirdly, what's your intuition for the kind of approximation we really
need?  Is this the right step forward?

As an historical note, irreducible edges were ignored entirely before I
rewrote BFI, so the bar for improving/maintaining the irreducible
control flow approximation was pretty low.  But I wonder whether
shoe-horning irreducible control flow into loop data structures (as I've
done, and this patch would continue) is the right approach.  Maybe they
should be more first-class somehow?

For example, another approach for an irreducible subgraph: distribute
mass from each "header" separately (in its own subgraph), thereby
calculating the probability that we transition from each header to each
other, giving us a probability matrix for a Markov Chain.  Then,
*waves-hands-furiously*, out pop accurate block masses for the headers
(and thusly the other nodes).

For example, a third approach: highlight a couple of types of
irreducible control flow structures that we actually care about,
recognize which type we have, and use an approximation that works
reasonably well for it.

Given that you've already done the work, I'm happy for this to go in
basically as is, provided that it doesn't cause compile-time or run-time
regressions.  But I'd like to get a sense of where this is headed, and
whether incrementally improving the approximation for irreducible
control flow will address the use cases you care about.

I also have a few comments inline below.

> Index: include/llvm/Analysis/BlockFrequencyInfoImpl.h
> ===================================================================
> --- include/llvm/Analysis/BlockFrequencyInfoImpl.h
> +++ include/llvm/Analysis/BlockFrequencyInfoImpl.h
> @@ -196,12 +196,13 @@
>    struct LoopData {
>      typedef SmallVector<std::pair<BlockNode, BlockMass>, 4> ExitMap;
>      typedef SmallVector<BlockNode, 4> NodeList;
> -    LoopData *Parent;       ///< The parent loop.
> -    bool IsPackaged;        ///< Whether this has been packaged.
> -    uint32_t NumHeaders;    ///< Number of headers.
> -    ExitMap Exits;          ///< Successor edges (and weights).
> -    NodeList Nodes;         ///< Header and the members of the loop.
> -    BlockMass BackedgeMass; ///< Mass returned to loop header.
> +    typedef SmallVector<BlockMass, 1> HeaderMassList;

This data type makes sense to me, but not the way you've indexed it.
See my comments below in `BlockFrequencyInfoImplBase::distributeMass()`.

> +    LoopData *Parent;            ///< The parent loop.
> +    bool IsPackaged;             ///< Whether this has been packaged.
> +    uint32_t NumHeaders;         ///< Number of headers.
> +    ExitMap Exits;               ///< Successor edges (and weights).
> +    NodeList Nodes;              ///< Header and the members of the loop.
> +    HeaderMassList BackedgeMass; ///< Mass returned to each loop header.
>      BlockMass Mass;
>      Scaled64 Scale;
>  
> @@ -431,6 +432,16 @@
>    /// \brief Compute the loop scale for a loop.
>    void computeLoopScale(LoopData &Loop);
>  
> +  /// Adjust the mass of all headers in an irreducible loop.
> +  ///
> +  /// Initially, irreducible loops are assumed to distribute their mass
> +  /// equally among its headers. This can lead to wrong frequency estimates
> +  /// since some headers may be executed more frequently than others.
> +  ///
> +  /// This adjusts header mass distribution so it matches the weights of
> +  /// the backedges going into each of the loop headers.
> +  void adjustLoopHeaderMass(LoopData &Loop);
> +
>    /// \brief Package up a loop.
>    void packageLoop(LoopData &Loop);
>  
> @@ -1042,6 +1053,8 @@
>      for (const BlockNode &M : Loop.Nodes)
>        if (!propagateMassToSuccessors(&Loop, M))
>          llvm_unreachable("unhandled irreducible control flow");
> +    adjustLoopHeaderMass(Loop);
>    } else {
>      Working[Loop.getHeader().Index].getMass() = BlockMass::getFull();
>      if (!propagateMassToSuccessors(&Loop, Loop.getHeader()))
> Index: lib/Analysis/BlockFrequencyInfoImpl.cpp
> ===================================================================
> --- lib/Analysis/BlockFrequencyInfoImpl.cpp
> +++ lib/Analysis/BlockFrequencyInfoImpl.cpp
> @@ -411,8 +414,10 @@
>  
>      // Check for a backedge.
>      if (W.Type == Weight::Backedge) {
> -      OuterLoop->BackedgeMass += Taken;
> -      DEBUG(debugAssign(BlockNode(), Taken, "back"));
> +      if (W.TargetNode.Index >= OuterLoop->BackedgeMass.size())
> +        OuterLoop->BackedgeMass.resize(W.TargetNode.Index + 1);

This doesn't make sense to me.  IIUC, you're using the index of the loop
header in the RPOT of the function.  Even without irreducible control
flow, this adds quadratic memory usage (scales with the number of blocks times
the number of loops, and there can be as many loops as there are blocks
in the worst case).

I think you have two options:

  - Force the size of the `BackedgeMass` vector to be exactly
    `NumHeaders`.  In this code, do a binary search through the range
    between `NodeList.begin()` and `NodeList.begin() + NumHeaders` to
    find the index of the header in `NodeList`, and use the same index
    for `BackedgeMass`.  This will have logarithmic runtime cost in the
    number of loop headers (locally), and linear memory requirements
    overall.
  - Use a `SmallDenseMap<BlockNode, BlockMass, 2>`.

I prefer the former approach, unless you have some reason against it?

Either way, you should encapsulate this logic (including the line that
follows) in a call to `OuterLoop`, e.g.
`LoopData::addBackedgeMass(BlockNode,BlockMass)`.

> +      OuterLoop->BackedgeMass[W.TargetNode.Index] += Taken;
> +      DEBUG(debugAssign(W.TargetNode, Taken, "back"));
>        continue;
>      }
>  
> @@ -713,10 +718,56 @@
>  void
>  BlockFrequencyInfoImplBase::updateLoopWithIrreducible(LoopData &OuterLoop) {
>    OuterLoop.Exits.clear();
> -  OuterLoop.BackedgeMass = BlockMass::getEmpty();
> +  OuterLoop.BackedgeMass.clear();
>    auto O = OuterLoop.Nodes.begin() + 1;
>    for (auto I = O, E = OuterLoop.Nodes.end(); I != E; ++I)
>      if (!Working[I->Index].isPackaged())
>        *O++ = *I;
>    OuterLoop.Nodes.erase(O, OuterLoop.Nodes.end());
>  }
> +
> +void BlockFrequencyInfoImplBase::adjustLoopHeaderMass(LoopData &Loop) {
> +  assert(Loop.isIrreducible() && "this only makes sense on irreducible loops");
> +
> +  // Since the loop has more than one header block, the mass flowing back into
> +  // each header will be different. Adjust the mass in each header loop to
> +  // reflect the masses flowing through back edges.
> +  //
> +  // To do this, we distribute the initial mass using the backedge masses
> +  // as weights for the distribution.
> +  BlockMass LoopMass = BlockMass::getFull();
> +  Distribution Dist;
> +
> +  DEBUG(dbgs() << "adjust-loop-header-mass:\n");
> +  for (uint32_t H = 0; H < Loop.NumHeaders; ++H) {
> +    auto &HeaderNode = Loop.Nodes[H];
> +    auto &BackedgeMass = Loop.BackedgeMass[HeaderNode.Index];
> +    DEBUG(dbgs() << " - Add back edge mass for node "
> +                 << getBlockName(HeaderNode) << ": " << BackedgeMass << "\n");
> +    Dist.addLocal(HeaderNode, BackedgeMass.getMass());
> +  }
> +
> +  DitheringDistributer D(Dist, LoopMass);
> +
> +#ifndef NDEBUG
> +  auto debugAssign = [&](const BlockNode &T, const BlockMass &M,
> +                         const char *Desc) {
> +    dbgs() << "  => assign " << M << " (" << D.RemMass << ")";
> +    if (Desc)
> +      dbgs() << " [" << Desc << "]";
> +    if (T.isValid())
> +      dbgs() << " to " << getBlockName(T);
> +    dbgs() << "\n";
> +  };
> +  (void)debugAssign;

Looks like this has been duplicated.  Can it be moved to a `static`
function?

> +#endif
> +
> +  DEBUG(dbgs() << " Distribute loop mass " << LoopMass
> +               << " to headers using above weights\n");
> +  for (const Weight &W : Dist.Weights) {
> +    BlockMass Taken = D.takeMass(W.Amount);
> +    assert(W.Type == Weight::Local && "all weights should be local");
> +    Working[W.TargetNode.Index].getMass() = Taken;
> +    DEBUG(debugAssign(W.TargetNode, Taken, nullptr));
> +  }

The way `Distribution` and `DitheringDistributer` work together is
pretty heavy for what you're doing here (e.g., with the local vs.
backedge vs. exit distinction).  Moreover, I think the simple behaviour
needed here would be useful outside of BFI.  Do you have any ideas for
how to restructure those classes and/or factor out the common logic?

> +}