[PATCH] D50433: A New Divergence Analysis for LLVM

Fri Aug 10 05:34:38 PDT 2018

simoll added a comment.

In https://reviews.llvm.org/D50433#1195085, @alex-t wrote:

> Yes, my approach requires pre-computation of post-dominance sets. Nevertheless, no predecessor walk is necessary. I compute the PDT sets using Cooper's "2 fingers" algorithm that uses linear walk of the post-dominator tree.
>  Given that the post-dominator tree is already used by the previous passes and has been constructed before the overhead is relatively small.

Ok. You still materialize the PDT sets per join point even before you know if any branch is divergent. I agree that the pre-processing cost should be negligible in the big picture. Btw, i suspect you technique could be made lazy as well.

> In fact I don't use non-filtered PDT as well. I should have describe the algorithm in more details.
>  I really use the set difference between the value source block and PHIs parent block.
> Namely:
> 
> additional "operands" set is computed as the join
> 
> for each PHI incoming value we consider the value source block post-dominance frontier. We take from it only blocks that are NOT in the post-dominance frontier of the join (PHIs) block.
>  That is the way to exclude those divergent branches that are common between the value source block and the join block.
> 
> The result set is the JOIN of the all input values difference sets.

How do you deal with terminators that have more than two successors? Example:

  switch (divInt) {
    case 0:
     v = 1.0; break;
    case 1:
     v = 20.0; break;
    default:
    // do stuff (block D)
    return;
  }
  // do other stuff, using 'v' (J)

  +----A----+
  |    |    |
  C0  C1    D
  \   |    [..]
   \  |
     J

J is control-dependent on A. Therefore, you will erase A from the PDT sets of C0 and C1. However, there exist two disjoint paths from A to J through C0 and C1, which make PHI nodes in J divergent.

> Unfortunately, I cannot paste here the formal definition that would look more comprehensive just because I have no idea how to insert TEX or another mean of writing equations here :)

Funny enough there is support for stuff like "[ctrl] + [alt] + [del]" but i couldn't find any for math ;)

> Okay. Let's say on iteration N we compute the branch divergence as 1 (divergent). We mark all the joins as divergent.
>  The question is which iteration the PHIs users are updated? For the PHIs that are dominated by the branch, I guess, users will be updated in the same iteration because by the moment PHIs are processed they already have the "divergent" flag.

After `DA::UpdatePHINode` has computed a new divergence value for the PHI node it will put all users of the PHI-node on the worklist.
That mechanism is purely worklist based and independent of dominance.
The DA side of divergence propagation is implemented in `DA::propagateX(..)` methods where X=`Loop|Branch|Join`. The `Loop` and `Branch` variants are basically the same.

> If the branch is the back edge source we have to iterate once more to propagate the sync divergence to the loop header and body. Is this correct?

Let's talk about loops :)

join points inside loops
------------------------

In reducible loops all live threads re-converge at the loop header. That means we can compute the sync dependences from a branch inside the loop in a single pass: if we kept iterating, we wouldn't detect any new join points (inside the loop).
One detail: we don't need to revisit the header because the SDA works in a "push" mode (e.g. the header "tag" is updated whenever a predecessor is processed).

join points outside the loop of the branch
------------------------------------------

In our nomenclature that kind of branch (divergent branch as back edge source (if the other successor is not post-dominated by the header..)) triggers a divergent loop exit.
Branches cause loop divergence if there is a disjoint path from the branch to a block outside the loop and another one back to the header.
So, a loop can well become divergent even though all immediate loop exiting branches are uniform (shown in the `hidden_loop_diverge` test of `test/Analysis/DivergenceAnalysis/AMDGPU/hidden_loopdiverge.ll`).
In that case, the entire loop becomes divergent (as far as the DA in this patch is concerned), meaning it can spew out divergent threads through all its loop exits.
That in turn means that there could be additional join points of divergent control outside the loop (as happens in the test case).
To that end, if a loop is first marked as divergent, we pretend it was a single node in the CFG with a divergent terminator (see `SDA::join_blocks(const Loop&)`, which does pretty much the same as its twin for `TerminatorInst`s).

Repository:
  rL LLVM

https://reviews.llvm.org/D50433