[PATCH] D50433: A New Divergence Analysis for LLVM

Fri Aug 24 09:28:47 PDT 2018

alex-t added a comment.

>> For given branch all the blocks where PHI nodes must be inserted belongs to the branch parent block dominance frontier.
> 
> The problem with DF is that it implicitly assumes that there exists a definition at the function entry (see http://www.cs.utexas.edu/~pingali/CS380C/2010/papers/ssaCytron.pdf, comment below last equation of page 17).
>  So, we would get imprecise results.

I'm not sure that I understand you correct...
I still do not get an idea what do you mean by "imprecise results". The assumption in the paper you have referred looks reasonable.
Let's say we have 2 disjoint paths from Entry to block X and you have a use of some value V in X.

        Entry
       /   \
      A     E
    /  \     \
  B     C     F  [v0 = 1]
    \  /      |
     D       /
      \ _   /     
          X   v1 = PHI (1, F, undef, Entry)

        Entry__
       /       \
    A [v1 = 2]  E
   /  \         |
  B     C       F  [v0 = 1]
   \  /      __/
    D       /
      \ _  /     
        X v2 = PHI (1, F, 2, A)

Irrelevant of the definition in Entry, divergent branch in Entry makes the PHI in X divergent.

Each of this 2 paths should contain a definition for V. It does not matter in entry block or in one of it's successors.
You either have a normal PHI or a PHI with undef incoming value. How to handle undef depends of your decision.
You may conservatively consider any undef as divergent. This make your PHI divergent by data-dependency.

> Control-dependence/PDT does not give you correct results (see my earlier example in https://reviews.llvm.org/D50433#1205934).

Oh.. :)   Please forget of the set differences that I mentioned. It was mentioned mistakenly.
You was concerned about using non-filtered PDFs since they could produce over-conservative results.
You probably meant not PDF but iterated PDF  (PDF+)
PDF(X)+ = for each Y in PRED(X) : PDF(X) JOIN PDF(Y)

To illustrate further - the algorithm to compute this looks as follows:
( the code is just a sketch to illustrate)

  for (auto & B : F)
  {
    const TerminatorInst * T = B->getTerminator();
    if (T->getNumSuccessors() > 1)
    {
      for ( auto & I : succs(B))
      {
        DomTreeNode * runner = PDT->getPostDomTree().getNode(
          const_cast<BasicBlock*>(*I));
        DomTreeNode * sentinel = PDT->getPostDomTree().getNode(
          const_cast<BasicBlock*>(&(*B)))->getIDom();
        while (runner && runner != sentinel)
        {
          PDF[runner->getBlock()].insert(&*B);
          runner = runner->getIDom();
        }
      }
    }
  }

I meant just PDF - without closure over all predecessors.

       Entry
     /       \
    A_____    E
   /       \   \
  B[v1 = 2] C  F  [v0 = 1]
   \  _____/ _/
    D       /
      \ _  /     
        X v2 = PHI (1, F, 2, A)

PDF(B) = {A}  DF+(B) = {A, Entry}
PDF(F) = DF+(F) = {Entry}

For PHI in X we have 2 source blocks - B and F so we only have to examine branches in A and Entry
If the second definition of V was in C instead of F we'd only look at the branch in A.

For your example with switch:

+----A----+

|  |  |
|

C0  C1    D
\   |    [..]
 \  |

  J

PDF(C0) = {A}
PDF(C1) = {A}

Let's say in J we have v2 = PHI(v0, A. v1, C0)  we should examine A terminator because PDF(C0) = {A}, PDF(A) = {}

> Now, we could use the DF of DT as it's often done in SSA construction.
>  When propagating a definition at a block B in the SDA we could skip right to the fringes of the DF of block B; there won't be any joins within the DF of B.
>  That does not fundamentally alter the algorithm - we would iterate over the DF blocks of B instead of its immediate successors.. that's it.
>  In a way, we do this already when a definition reaches the loop header from outside the loop: in that case we skip right to the loop exits, knowing that the loop is reducible and so the loop header dominates at least all blocks inside the loop.
> 
>> I do not consider this as a serious issue. I just noticed that there is a lot of code that computes the information that has already been computed once before.
> 
> Using DF could speed up the join point computation at the cost of pre-computing the DF for the region. It really comes down to the numbers. Are there any other users of DF in your pipeline that would offset its cost?
>  If not i would suggest planning lazy DF computation for a future revision of the DivergenceAnalysis in case SDA performance should ever become an issue.

Repository:
  rL LLVM

https://reviews.llvm.org/D50433