<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Mar 25, 2017 at 11:22 AM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div bgcolor="#FFFFFF" text="#000000"><div><div class="m_-4421022136755295261h5">

    <p><br>

    </p>

    <div class="m_-4421022136755295261m_-6247637909381991151moz-cite-prefix">On 03/25/2017 10:57 AM, Daniel Berlin

      wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr"><br>

        <div class="gmail_extra"><br>

          <div class="gmail_quote">On Sat, Mar 25, 2017 at 5:28 AM, Hal

            Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span>

            wrote:<br>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

              <div bgcolor="#FFFFFF"><span class="m_-4421022136755295261m_-6247637909381991151gmail-">

                  <p><br>

                  </p>

                  <div class="m_-4421022136755295261m_-6247637909381991151gmail-m_1179969366190122482moz-cite-prefix">On

                    03/24/2017 06:00 PM, Daniel Berlin wrote:<br>

                  </div>

                  <blockquote type="cite">

                    <div dir="ltr"><br>

                      <div class="gmail_extra"><br>

                        <div class="gmail_quote">On Fri, Mar 24, 2017 at

                          3:44 PM, Daniel Berlin <span dir="ltr"><<a href="mailto:dberlin@dberlin.org" target="_blank">dberlin@dberlin.org</a>></span>

                          wrote:<br>

                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                            <div dir="ltr">In which i repeat my claim

                              that trying to keep caches up to date is

                              significantly harder to get right and

                              verify than incremental recomputation :)</div>

                          </blockquote>

                          <div><br>

                          </div>

                          <div>and to be super clear here, the practical

                            difference between the two is that

                            incremental computation is really "build an

                            analysis that can be run at any point, but

                            is a full analysis that computes all answers

                            and not on demand. The speed at which the

                            analysis runs depends on how much data has

                            changed since last time.".</div>

                        </div>

                      </div>

                    </div>

                  </blockquote>

                  <br>

                </span> How exactly do you propose this works in this

                case (given that the recomputation cost is proportional

                to the amount the IR has changed)? </div>

            </blockquote>

            <div><br>

            </div>

            <div>As i mentioned, first i would actually measure whether

              we really need to be recomputing that often *at all*.</div>

            <div>There is no point in either incrementally recomputing

              *or* computing on demand if it is not actually buying us

              significant generated code performance.</div>

            <div>The simplest thing to verify correct is the one that

              does not need to be updated at all :)</div>

            <div><br>

            </div>

            <div>Second,yes,  the recomputation cost is proportional in

              either case.</div>

            <div> </div>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

              <div bgcolor="#FFFFFF">It sounds like caching with the

                default behavior on RAUW flipped. </div>

            </blockquote>

            <div>In this case, it would be similar.</div>

            <div>Your caching, right now, has the following properties

              that are maybe easily changed:<br>

            </div>

            <div>It is not a full analysis, it must be handled

              explicitly</div>

            <div>It has no verifier (you would think you could write one

              by comparing every value against computing a not cached

              result, but i think you will discover that does not work

              either, see below)</div>

            <div>It has no printer.</div>

            <div>The above are all things you suggested might be fixed

              anyway, so i'm going to leave them out except

              verification.</div>

            <div><br>

            </div>

            <div>Past that, the other explicit difference here is the

              approach to computation.<br>

            </div>

            <div><br>

            </div>

            <div>On-demand cached computation here starts at "thing that

              i am missing data for", and  goes walking through the IR

              backwards to try to find the answer.</div>

            <div>It does not pay attention to computation order, etc.</div>

            <div>You start at the leaves, you end at the the root.</div>

            <div>That means it often does significant duplicated work

              (even if that work is wasted cache lookups and visits) or

              provably does not get a maximal fixpoint when making use

              of the cache.</div>

            <div><br>

            </div>

            <div>In this case, it is both ;)</div>

            <div><br>

            </div>

            <div>For example, for an operator, it computes the known

              bits for the LHS and RHS recursively.</div>

            <div>Without caching, it computes them repeatedly, visiting

              from use->def.</div>

            <div><br>

            </div>

            <div>Incremental recomputation would compute known-bits as a

              forwards propagation problem over the function. </div>

            <div>You start at the root, you end at the leaves.</div>

            <div>When you visit an operator, the LHS and RHS would

              already have been visited (if they are not part of a

              cycle), and thus you compute it *once*.</div>

            <div>This requires ordering the computation and propagation

              properly, mind you.</div>

            <div><br>

            </div>

            <div>Second, the caching version does not reach a maximal

              fixpoint:</div>

            <div>

              <div>        // Recurse, but cap the recursion to one

                level, because we don't</div>

              <div>        // want to waste time spinning around in

                loops.</div>

            </div>

            <div>This is because it is working backwards, on demand, and

              so phi computation sometimes cannot reach a fixpoint in a

              sane amount of time, so it gets cutoff.<br>

            </div>

            <div>Solving the forwards problem properly, in the correct

              order, will *always* give the same answer from the same

              starting point, given the same set of invalidation.<br>

            </div>

            <div><br>

            </div>

            <div>Note that this in turn affects verification.  Because

              the on-demand version has cut-offs that effectively depend

              on visitation depth, the cached version will get different

              answers depending on how much is cached, because it will

              need to go less deep to get a specific answer.</div>

            <div>This depends on visitation order.</div>

            <div>This in turn, makes it really hard to write a verifier

              :)  You can't just compared cached vs non-cached.  The

              cached may get better or worse answers depending on

              visitation order (incremental recomputation never does in

              this case, assuming you invalidate properly in both

              cases).<br>

            </div>

            <div><br>

            </div>

            <div>I'm pretty sure *you* get this, but for those following

              along:<br>

              Imagine you have a phi</div>

            <div>phi(a, b, c, d)</div>

            <div><br>

            </div>

            <div>Currently, while walking this phi, it will only recurse

              a certain depth to find an answer.  Let's pretend it's

              depth 1.</div>

            <div>Let's say given the initial order you call compute

              demanded bits in (you call it on the phi before a, b, c,

              d),  they are all too deep for it to find an answer, it

              can't visit deep enough to compute a, b, c and d</div>

            <div><br>

            </div>

            <div>Thus, if you start from *nothing*, the answer you get

              for the phi is:<br>

              "unknown".</div>

            <div>Now, computing the phi is separate from computing the

              answer for a, b, c, d.</div>

            <div>So next, you visit a, b, c, and d, and compute their

              answers.</div>

            <div>It can successfully do so (because the depth limit

              applies to the phi, and even if it didn't, the depth limit

              for these guys is +1 what it was from the phi)</div>

            <div><br>

            </div>

            <div>Now, if you invalidate the phi in the cache (assume

              it's just an innocent victim and the value did not really

              change, it just "may have"), and ask it to compute the

              answer:<br>

              <br>

            </div>

            <div>low and behold, now it has the answers for a, b, c and

              d, and can give you an answer for the phi.</div>

            <div><br>

            </div>

            <div>So the cached version gave us a better answer.<br>

            </div>

            <div><br>

            </div>

            <div>The real reason is because it got closer to the maximal

              fixpoint than the original version.</div>

            <div><br>

              But still, much harder to verify because the answer really

              is allowed to vary depending on the order in which you ask

              it for answers.</div>

            <div><br>

            </div>

            <div>The forwards incremental computation has exactly zero

              of these problems.</div>

            <div>It always computes the same answer.</div>

            <div>It only varies the time taken to do so.</div>

            <div><br>

            </div>

            <div> <br>

            </div>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

              <div bgcolor="#FFFFFF">Instead of assuming by default that

                the replacement instructions has the same known bits as

                the original, we assume that it does not (and then

                recursively clear out all users).<br>

              </div>

            </blockquote>

            <div><br>

            </div>

            <div>If they were computed in the same direction, with the

              same propagation  strategy, yes, i would agree.</div>

          </div>

        </div>

      </div>

    </blockquote>

    <br></div></div>

    In general, I agree with all of this. To be clear, I posted this

    patch so that we could make some measurements of InstCombine with

    the known-bits computation cost stabilized. </div></blockquote><div>Of course. I think the work is very valuable. I was more opposed to the immediate response i saw from some others of "let's just take this and push it in because it's our only real option".</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">Given that we compute

    known bits on essentially every relevant instruction, doing this

    ahead of time makes perfect sense (i.e. aside from all of the

    disadvantages, we likely gain very little from the laziness).<br></div></blockquote><div><br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">

    <br>

    I do want to explicitly clarify the following point, however. In the

    analysis as you imagine it, what exactly do we do when we replace an

    IR value (or update an instruction's flags)?<br></div></blockquote><div>In the current world, your only option is to have the optimization pass notify the analysis, explicitly, through a call, that piece of IR "X" got invalidated.</div><div>We could build something else, but i'd start there anyway, becasue the exact mechanism through which the analysis learns of invalidation doesn't change what it does internally.</div><div><br></div><div>Once it gets notification, it has two options:</div><div>1. It throws it on a list, and the next time you ask it for a result, it does the recomputation (which presents the same interface as on-demand, but isn't really the same internally)</div><div>2. It does the recomputation immediately.</div><div><br></div><div>Both are valid options.</div><div><br></div><div>It's trickier if you *also* invalidate the CFG, but that's handleable too.</div><div>i"m going to assume we aren't invalidating the CFG here for a second.</div><div><br></div><div>In practice, there is not a ton of differences between the two internally.</div><div>In *either* case, You shove the names of pieces of IR that got invalidated onto a list, grow the list to  the bounds of what needs to be recomputed, and do it.</div><div><br></div><div>For SSA based analysis that is single-pass, you can just invalidate the users.</div><div>For SSA vased analysis that is optimistic/iterative, the truly easy solution is to convert it to use SCC's of the def-use graph, which should be amazingly trivial - if you could get a maximal result based on def-use propagation, you will get a maximal result by iterating SCC's</div><div><br></div><div>1. Form SCCs of the SSA graph, tarjans will give you a topo ordering.</div><div>For each SCC in topo order, iterate your analysis on the SCC values until it stops changing.</div><div>This is guaranteed to give a maximal result, and the condensed graph of SCC's is acylic.</div><div><br></div><div>2.  When something is invalidated, you invalidate the results for the SCC, reiterate the SCC.</div><div>If the result ends up same as it was before, you are done</div><div>If it changes, you invalidate the SCC's using values from the current SCC (using the topo info), and recompute them.</div><div>Keep going along each path of using sccs in that order until result stops changing in each path.</div><div>When you run out of paths, you are done.</div><div><br></div><div><br></div><div>This is of course, not the only way, and not even guaranteed to the fastest it just makes it easy to do and verify, and is usually quite fast.</div><div><br></div><div>Note; If your value change modified the SCC structure, that too can be updated fairly easily.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">

    Thanks again,<br>

    Hal<div><div class="m_-4421022136755295261h5"><br>

    <br>

    <blockquote type="cite">

      <div dir="ltr">

        <div class="gmail_extra">

          <div class="gmail_quote">

            <div> <br>

            </div>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

              <div bgcolor="#FFFFFF"> <br>

                 -Hal<span class="m_-4421022136755295261m_-6247637909381991151gmail-"><br>

                  <br>

                  <blockquote type="cite">

                    <div dir="ltr">

                      <div class="gmail_extra">

                        <div class="gmail_quote">

                          <div>The compile time hit depends on how often

                            you want answers.  Usually people try to say

                            "i want them always up to date" (often

                            without any quantification of cost or other

                            trade-offs), which leads to trying to do it

                            on-demand</div>

                          <div><br>

                          </div>

                          <div>Caching is really  "build something that

                            is always computed on demand.  Try to make

                            it faster by storing the answers it gives

                            somewhere, and hoping that the rest of the

                            algorithm retains the ability to give

                            correct answers on demand when you remove

                            some of the stored answers".</div>

                          <div><br>

                          </div>

                          <div>One can obviously argue that incremental

                            recomputation is a form of caching, so i'm

                            trying to be concrete about my definitions.</div>

                          <div><br>

                          </div>

                          <div>usually, caching additionally requires

                            clients be able to correctly predict and

                            remove the set of answers necessary for

                            correctness (IE know enough about how the

                            algorithm works) but this isn't strictly the

                            case if designed well.</div>

                          <div><br>

                          </div>

                          <div>Caching also often ends up with degraded

                            answers over time, while most full analysis

                            can be made to not do so.  See, for example,

                            memdep, which has unresolvable cache

                            invalidation issues.  This means, for

                            example, opt -gvn -gvn gives different

                            answers than opt -gvn  | opt -gvn.</div>

                          <div><br>

                          </div>

                          <div>Both approaches obviously require

                            information about what has changed in the

                            IR.</div>

                          <div><br>

                          </div>

                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                            <div dir="ltr">

                              <div> I additionally claim that we should

                                actually quantify the performance

                                benefits of doing either over

                                "recomputing once or twice during

                                pipeline" before going down this route.

                                <div><br>

                                </div>

                                <div>We are *already* trading off

                                  compile time performance vs generated

                                  code performance. </div>

                                <div><br>

                                </div>

                                <div>All other things being equal, we

                                  should trade it off in a way that

                                  makes maintenance of code and

                                  verification of correctness as easy as

                                  possible.</div>

                                <div><br>

                                </div>

                                <div>Asserting that a good tradeoff to

                                  reduce compile time is "cache" instead

                                  of "stop doing it on demand" (or

                                  incrementally recompute), without any

                                  data other than compile time

                                  performance seems ... not particularly

                                  scientific.</div>

                                <div><br>

                                </div>

                                <div>It's certainly true that caching is

                                  often posited as "easier", but i think

                                  a trip through bugzilla would put this

                                  idea to rest.</div>

                                <div><br>

                                </div>

                                <div><br>

                                </div>

                              </div>

                            </div>

                            <div class="m_-4421022136755295261m_-6247637909381991151gmail-m_1179969366190122482HOEnZb">

                              <div class="m_-4421022136755295261m_-6247637909381991151gmail-m_1179969366190122482h5">

                                <div class="gmail_extra"><br>

                                  <div class="gmail_quote">On Fri, Mar

                                    24, 2017 at 3:33 PM, Craig Topper

                                    via Phabricator <span dir="ltr"><<a href="mailto:reviews@reviews.llvm.org" target="_blank">reviews@reviews.llvm.org</a>></span>

                                    wrote:<br>

                                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">craig.topper

                                      added a comment.<br>

                                      <br>

                                      I'm seeing more problems than just

                                      nsw/nuw flags here.<br>

                                      <br>

                                      sext.ll test is failing because

                                      SimplifyDemandedInstructions bits

                                      determined that this<br>

                                      <br>

                                      %and = and i32 %x, 16<br>

                                      <br>

                                        shl i32 %and, 27<br>

                                      <br>

                                      Simplified to just the shl because

                                      we were only demanding the MSB of

                                      the shift. This occurred after we

                                      had cached the known bits for the

                                      shl as having 31 lsbs as 0. But

                                      without the "and" in there we can

                                      no longer guarantee the lower bits

                                      of the shift result are 0.<br>

                                      <br>

                                      I also got a failure on shift.ll

                                      not reported here. This was caused

                                      by getShiftedValue rewriting

                                      operands and changing constants of

                                      other shifts. This effectively

                                      shifts the Known bits and thus the

                                      cache needs to be invalidate.<br>

                                      <br>

                                      <br>

                                      <a href="https://reviews.llvm.org/D31239" rel="noreferrer" target="_blank">https://reviews.llvm.org/D3123<wbr>9</a><br>

                                      <br>

                                      <br>

                                      <br>

                                    </blockquote>

                                  </div>

                                  <br>

                                </div>

                              </div>

                            </div>

                          </blockquote>

                        </div>

                        <br>

                      </div>

                    </div>

                  </blockquote>

                  <br>

                </span><span class="m_-4421022136755295261m_-6247637909381991151gmail-HOEnZb"><font color="#888888">

                    <pre class="m_-4421022136755295261m_-6247637909381991151gmail-m_1179969366190122482moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

                  </font></span></div>

            </blockquote>

          </div>

          <br>

        </div>

      </div>

    </blockquote>

    <br>

    <pre class="m_-4421022136755295261m_-6247637909381991151moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

  </div></div></div>

</blockquote></div><br></div></div>