<div dir="ltr">To give a concrete example, let's trace NewGVN. However, This applies to literally any rapid forward dataflow problem trying to evaluate the SSA graph in RPO order (IE from defs to uses).  I'm going to highly contrive an example to make it come out right :)<div><br></div><div>Code:<br><br></div><div><br></div><div><br></div><div>entry:</div><div><br>t1 = 50</div><div>br ? block1, block10000</div><div><br></div><div>block1:<br>t2 = phi [undef, entry], [aa, block10000]</div><div>aa = phi [undef, entry], [t1, block10000]</div><div><br></div><div>t3 = t2 + 5</div><div>t4 = t3 - 7</div><div>...</div><div><10000 more blocks></div><div>block10000:<br>jump block1<br></div><div><br></div><div>The proper RPO ordering, defs to uses, of SSA values is </div><div>t1</div><div>aa</div><div>t2</div><div>t3</div><div>t4</div><div>....</div><div><br></div><div>The ordering we will get if we just walk blocks + instructions, without sorting phi nodes explicitly above is</div><div><br></div><div>t1</div><div>t2</div><div>aa</div><div>t3</div><div>t4</div><div>...</div><div><br></div><div><br></div><div>Let's go with the latter ordering for the moment.  It is not a proper RPO ordering of the SSA graph (because the def-use sequence must include the postorder sequence t3 t2 aa, and so the RPO must include aa t2 t3)</div><div><br></div><div>start: everything is in class initial.</div><div><br></div><div>Process t1 - we set the value of t1 to 50, mark t2 for processing</div><div>Process jump, mark block 1 and 10000 reachable, mark aa, t2 for processing</div><div>process t2 - 2 incoming edges reachable, value numbers to aa. AA is in class initial, so stays in congruence class initial, mark aa, t3 for processing</div><div>process aa - 2 incoming edges reachable, value numbers to 50, congruent to t1, t2 marked for reprocessing</div><div>t3 value numbers to aa + 5, new congruence class created, mark t4 for processing</div><div><10000 other things processed><br>iteration 1 done</div><div>...</div><div>t2 reprocessed, 2 incoming edges reachable, value numbers to 50, congruent to t1, t3 marked for processing.</div><div>t3 reprocessed, value numbers to 55, new congruence class created, t4 marked for processing </div><div>....</div><div><br></div><div><br></div><div>You now performed 1 extra iteration, and 10000 useless evaluations.</div><div><br></div><div>In the former ordering, you have :<br><br></div><div><div>Process t1 - we set the value of t1 to 50, mark t2 for processing</div><div>Process jump, mark block 1 and 10000 reachable, mark aa, t2 for processing</div><div>process aa - 2 incoming edges reachable, value numbers to 50, congruent to t1,mark t2 for processing</div><div>process t2 - 2 incoming edges reachable, value numbers to 50, mark t3 for processing</div><div>t3 processed, value numbers to 55, new congruence class created, ....</div><div><br></div><div>....</div><div><br></div><div>You finish in one iteration, no useless evaluations</div><div><br></div></div><div>Obviously these are very very contrived examples, but hopefully it's visible what the issue is.<br></div><div>In practice, there are cases where this issue causes extra iterations, sometimes into the hundreds once you get through all the dependencies. But even where it takes only a few extra iterations, it can also cause many many thousands of useless evaluations.</div><div>Not just for NewGVN, but for example, for SCCP as well.<br></div><div><br></div><div>All it would take to fix is to require that, for phi nodes in the same block,either they must be dependent on each other, or the "defining" must appear before the "using" phi node.</div><div><br></div><div>That is, our current defs before uses verification includes phi nodes, and gives a pass to those phi nodes that are mutually dependent.</div><div><br></div><div>I could also make NewGVN, SCCP, etc just sort them this way.</div><div><br></div><div>But it is an issue in practice in larger testcases :)</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Dec 28, 2016 at 8:10 PM, Daniel Berlin <span dir="ltr"><<a href="mailto:dberlin@dberlin.org" target="_blank">dberlin@dberlin.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="h5">On Wed, Dec 28, 2016 at 8:09 PM, Daniel Berlin <span dir="ltr"><<a href="mailto:dberlin@dberlin.org" target="_blank">dberlin@dberlin.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="m_2716168262506064709h5">On Wed, Dec 28, 2016 at 7:22 PM, Friedman, Eli <span dir="ltr"><<a href="mailto:efriedma@codeaurora.org" target="_blank">efriedma@codeaurora.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  

    

  

  <div bgcolor="#FFFFFF"><span class="m_2716168262506064709m_5578209831126250579gmail-">

    <div class="m_2716168262506064709m_5578209831126250579gmail-m_-5492007076233974577moz-cite-prefix">On 12/28/2016 2:33 PM, Daniel Berlin

      wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr"><br>

        <div class="gmail_extra"><br>

          <div class="gmail_quote">On Wed, Dec 28, 2016 at 1:18 PM,

            Friedman, Eli <span dir="ltr"><<a href="mailto:efriedma@codeaurora.org" target="_blank">efriedma@codeaurora.org</a>></span>

            wrote:<br>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

              <div bgcolor="#FFFFFF"><span class="m_2716168262506064709m_5578209831126250579gmail-m_-5492007076233974577gmail-">

                  <div class="m_2716168262506064709m_5578209831126250579gmail-m_-5492007076233974577gmail-m_-4041176510637431261moz-cite-prefix">On

                    12/28/2016 1:03 PM, Daniel Berlin via llvm-commits

                    wrote:<br>

                  </div>

                  <blockquote type="cite">

                    <div dir="ltr"><br>

                      <div class="gmail_extra"><br>

                        <div class="gmail_quote">On Wed, Dec 28, 2016 at

                          7:04 AM, Davide Italiano via Phabricator <span dir="ltr"><<a href="mailto:reviews@reviews.llvm.org" target="_blank">reviews@reviews.llvm.org</a>></span>

                          wrote:<br>

                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">davide

                            accepted this revision.<br>

                            davide added a comment.<br>

                            This revision is now accepted and ready to

                            land.<br>

                            <br>

                            Sorry for the slow response, I'm out('ish)

                            of the office these days. I took a close

                            look at your patch.<br>

                          </blockquote>

                          <div><br>

                          </div>

                          <div>No worries.</div>

                          <div> </div>

                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> I happen

                            to be lucky enough to hit a case in the wild

                            where this already matters.  The number of

                            iteration goes down from hundreds to ~10,

                            which makes compile time/me happier.<br>

                          </blockquote>

                          <div><br>

                          </div>

                          <div>yay.</div>

                          <div><br>

                          </div>

                          <div>The current code, excepting super-weird

                            cases, should operate in O(d+3) iterations,

                            where d is the loop connectedness of the SSA

                            graph (not the CFG), which is the number of

                            backedges in any path. This will change when

                            we move to equality propagation, but for

                            now, ...</div>

                          <div>We could  calculate this number and see

                            if we are screwing up :)<br>

                            <br>

                          </div>

                          <div>For most programs, the loop connectedness

                            of the SSA graph is the same or less than

                            the CFG.</div>

                          <div><br>

                          </div>

                          <div>However,  IIRC, we allow  dependent phis

                            in the same block (this is not strictly SSA,

                            since all phi nodes are supposed to be

                            evaluated simultaneously).</div>

                        </div>

                      </div>

                    </div>

                  </blockquote>

                  <br>

                </span> I'm not sure what you're trying to say here? 

                PHI nodes for a given basic block are evaluated

                simultaneously. From LangRef: "<span>For the purposes of

                  the SSA form, the use of each incoming value is deemed

                  to occur on the edge from the corresponding

                  predecessor block to the current block (but after any

                  definition of an ‘</span><code class="m_2716168262506064709m_5578209831126250579gmail-m_-5492007076233974577gmail-m_-4041176510637431261docutils m_2716168262506064709m_5578209831126250579gmail-m_-5492007076233974577gmail-m_-4041176510637431261literal"><span class="m_2716168262506064709m_5578209831126250579gmail-m_-5492007076233974577gmail-m_-4041176510637431261pre">invoke</span></code><span>‘

                  instruction’s return value on the same edge)."</span><span class="m_2716168262506064709m_5578209831126250579gmail-m_-5492007076233974577gmail-HOEnZb"><font color="#888888"><br>

                    <br>

                  </font></span></div>

            </blockquote>

            <div><br>

            </div>

            <div>I'm saying we've had mailing list arguments about this,

              about whether there is any ordering among phi nodes in a

              given block.  The part you quote from the langref does not

              actually definitively answer that (again, there is no

              argument in theory.  In the abstract, the answer is "there

              is no ordering, it's undefined to have phis depend in the

              same block depend on each other")</div>

            <div><br>

            </div>

            <div>Given</div>

            <div>

              <div>b = phi(d, e)</div>

              <div>a = phi(b, c)</div>

            </div>

            <div><br>

            </div>

            <div>Saying "is deemed to occur on the edge of the

              corresponding predecessor block" does not help.</div>

          </div>

        </div>

      </div>

    </blockquote>

    <br></span>

    Consider the following function:<br>

    <br>

    void f(int a, int b, int g(int, int)) {<br>

      while (g(a, b)) { int x = a; a = b; b = x; }<br>

    }<br>

    <br>

    mem2reg produces this:<br>

    <br>

    define void @f(i32 %a, i32 %b, i32 (i32, i32)* %g) #0 {<br>

    entry:<br>

      br label %while.cond<br>

    <br>

    while.cond:                   <wbr>                    ; preds =

    %while.body, %entry<br>

      %a.addr.0 = phi i32 [ %a, %entry ], [ %b.addr.0, %while.body ]<br>

      %b.addr.0 = phi i32 [ %b, %entry ], [ %a.addr.0, %while.body ]<br>

      %call = call i32 %g(i32 %a.addr.0, i32 %b.addr.0)<br>

      %tobool = icmp ne i32 %call, 0<br>

      br i1 %tobool, label %while.body, label %while.end<br>

    <br>

    while.body:                   <wbr>                    ; preds =

    %while.cond<br>

      br label %while.cond<br>

    <br>

    while.end:                    <wbr>                    ; preds =

    %while.cond<br>

      ret void<br>

    }<br>

    <br>

    A "phi" works in the only way which allows this IR to match the

    semantics of the C code.<br></div></blockquote><div><br></div></div></div><div>I'm not sure i believe this, but i actually don't care enough to argue about it further :)<br>That said, i'm curious:<br><br class="m_2716168262506064709m_5578209831126250579gmail-Apple-interchange-newline">What are the expected semantics of the second case i presented?<br> </div><span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF">

    <br>

    If you think LangRef isn't clear, suggestions are welcome.</div></blockquote></span><div>I would be explicit that phi nodes may depend on each other, and what the expected evaluation order actually is (if it's "as-they-appear-in-IR", say that.)</div><div>It looks something like "dependencies are allowed, any order is allowed despites dependencies".</div><span><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"><span class="m_2716168262506064709m_5578209831126250579gmail-"><br>

    <br>

    <blockquote type="cite">

      <div>Popping back up, regardless of resolution, this causes the

        issue i mentioned above - it may require more iterations to

        resolve because of the second case passing verification.  If we

        really want phi nodes to be executable , and want it to take the

        minimum number of iterations to converge NewGVN, we need to

        process aa before t1.</div>

      <div><br>

      </div>

      <div>Otherwise, we will process t1, get some value, *then* process

        aa, and immediately mark t1 as needing to be reprocessed since

        it is a use of aa.  We effectively waste an iteration because

        all of t1's uses have are going to have the wrong value.</div>

    </blockquote>

    <br></span>

    It looks like NewGVN creates one less congruence class if you

    process them in the "right" order. </div></blockquote><div><br></div></span><div>Well, no. It's not about creating the congruence class, it's about making it a reverse post-order evaluation .  In your example above, there is no single RPO order due to the two cycles.</div><div>In my example, any valid  RPO order of the SSA graph must visit aa before t1.</div><div>This is *normally* taken care of by evaluating instructions in block order, and ordering the CFG in RPO.   But in the case i gave, it is not enough. </div><div>You would have to explicitly sort the phi nodes separately, since we allow both orderings of the phi nodes in LLVM,despite only one being RPO (and only one having defs dominate uses).</div><span><div><br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"> I'm not sure there's any way to

    usefully generalize that heuristic, though; you're only saving time

    based on discovering the cycle one step faster.<span class="m_2716168262506064709m_5578209831126250579gmail-"><br>

    <br></span></div></blockquote></span><div>I'm not sure what you are trying to say here.</div><div>You are saving time by not performing iterations that are unnecessary.  The generalized heuristic is exactly "perform this problem by evaluating the SSA graph/CFG in RPO order".  This is provable, it's the notion of a "rapid" problem.</div></div></div></div></blockquote><div><br></div></div></div><div>s/notion/number of iterations it takes to compute a/ </div><span class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><br></div><div>If you placed my aa/t1 example in the first block of a 10000 block function, you will process 10000 blocks (1 iteration) uselessly.</div></div></div></div></blockquote><div><br></div></span><div>(assuming all values are dependent on previous values, ...)</div><span class=""><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>If you sort it into a valid RPO order of the SSA graph, you will not.</div><div><br></div><div>This generalizes to *any* "rapid" problem.</div><div><br></div></div></div></div>

</blockquote></span></div><br></div></div>

</blockquote></div><br></div>