<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Mar 25, 2017 at 11:22 AM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF" text="#000000"><div><div class="m_-4421022136755295261h5">
    <p><br>
    </p>
    <div class="m_-4421022136755295261m_-6247637909381991151moz-cite-prefix">On 03/25/2017 10:57 AM, Daniel Berlin
      wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr"><br>
        <div class="gmail_extra"><br>
          <div class="gmail_quote">On Sat, Mar 25, 2017 at 5:28 AM, Hal
            Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span>
            wrote:<br>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
              <div bgcolor="#FFFFFF"><span class="m_-4421022136755295261m_-6247637909381991151gmail-">
                  <p><br>
                  </p>
                  <div class="m_-4421022136755295261m_-6247637909381991151gmail-m_1179969366190122482moz-cite-prefix">On
                    03/24/2017 06:00 PM, Daniel Berlin wrote:<br>
                  </div>
                  <blockquote type="cite">
                    <div dir="ltr"><br>
                      <div class="gmail_extra"><br>
                        <div class="gmail_quote">On Fri, Mar 24, 2017 at
                          3:44 PM, Daniel Berlin <span dir="ltr"><<a href="mailto:dberlin@dberlin.org" target="_blank">dberlin@dberlin.org</a>></span>
                          wrote:<br>
                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                            <div dir="ltr">In which i repeat my claim
                              that trying to keep caches up to date is
                              significantly harder to get right and
                              verify than incremental recomputation :)</div>
                          </blockquote>
                          <div><br>
                          </div>
                          <div>and to be super clear here, the practical
                            difference between the two is that
                            incremental computation is really "build an
                            analysis that can be run at any point, but
                            is a full analysis that computes all answers
                            and not on demand. The speed at which the
                            analysis runs depends on how much data has
                            changed since last time.".</div>
                        </div>
                      </div>
                    </div>
                  </blockquote>
                  <br>
                </span> How exactly do you propose this works in this
                case (given that the recomputation cost is proportional
                to the amount the IR has changed)? </div>
            </blockquote>
            <div><br>
            </div>
            <div>As i mentioned, first i would actually measure whether
              we really need to be recomputing that often *at all*.</div>
            <div>There is no point in either incrementally recomputing
              *or* computing on demand if it is not actually buying us
              significant generated code performance.</div>
            <div>The simplest thing to verify correct is the one that
              does not need to be updated at all :)</div>
            <div><br>
            </div>
            <div>Second,yes,  the recomputation cost is proportional in
              either case.</div>
            <div> </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
              <div bgcolor="#FFFFFF">It sounds like caching with the
                default behavior on RAUW flipped. </div>
            </blockquote>
            <div>In this case, it would be similar.</div>
            <div>Your caching, right now, has the following properties
              that are maybe easily changed:<br>
            </div>
            <div>It is not a full analysis, it must be handled
              explicitly</div>
            <div>It has no verifier (you would think you could write one
              by comparing every value against computing a not cached
              result, but i think you will discover that does not work
              either, see below)</div>
            <div>It has no printer.</div>
            <div>The above are all things you suggested might be fixed
              anyway, so i'm going to leave them out except
              verification.</div>
            <div><br>
            </div>
            <div>Past that, the other explicit difference here is the
              approach to computation.<br>
            </div>
            <div><br>
            </div>
            <div>On-demand cached computation here starts at "thing that
              i am missing data for", and  goes walking through the IR
              backwards to try to find the answer.</div>
            <div>It does not pay attention to computation order, etc.</div>
            <div>You start at the leaves, you end at the the root.</div>
            <div>That means it often does significant duplicated work
              (even if that work is wasted cache lookups and visits) or
              provably does not get a maximal fixpoint when making use
              of the cache.</div>
            <div><br>
            </div>
            <div>In this case, it is both ;)</div>
            <div><br>
            </div>
            <div>For example, for an operator, it computes the known
              bits for the LHS and RHS recursively.</div>
            <div>Without caching, it computes them repeatedly, visiting
              from use->def.</div>
            <div><br>
            </div>
            <div>Incremental recomputation would compute known-bits as a
              forwards propagation problem over the function. </div>
            <div>You start at the root, you end at the leaves.</div>
            <div>When you visit an operator, the LHS and RHS would
              already have been visited (if they are not part of a
              cycle), and thus you compute it *once*.</div>
            <div>This requires ordering the computation and propagation
              properly, mind you.</div>
            <div><br>
            </div>
            <div>Second, the caching version does not reach a maximal
              fixpoint:</div>
            <div>
              <div>        // Recurse, but cap the recursion to one
                level, because we don't</div>
              <div>        // want to waste time spinning around in
                loops.</div>
            </div>
            <div>This is because it is working backwards, on demand, and
              so phi computation sometimes cannot reach a fixpoint in a
              sane amount of time, so it gets cutoff.<br>
            </div>
            <div>Solving the forwards problem properly, in the correct
              order, will *always* give the same answer from the same
              starting point, given the same set of invalidation.<br>
            </div>
            <div><br>
            </div>
            <div>Note that this in turn affects verification.  Because
              the on-demand version has cut-offs that effectively depend
              on visitation depth, the cached version will get different
              answers depending on how much is cached, because it will
              need to go less deep to get a specific answer.</div>
            <div>This depends on visitation order.</div>
            <div>This in turn, makes it really hard to write a verifier
              :)  You can't just compared cached vs non-cached.  The
              cached may get better or worse answers depending on
              visitation order (incremental recomputation never does in
              this case, assuming you invalidate properly in both
              cases).<br>
            </div>
            <div><br>
            </div>
            <div>I'm pretty sure *you* get this, but for those following
              along:<br>
              Imagine you have a phi</div>
            <div>phi(a, b, c, d)</div>
            <div><br>
            </div>
            <div>Currently, while walking this phi, it will only recurse
              a certain depth to find an answer.  Let's pretend it's
              depth 1.</div>
            <div>Let's say given the initial order you call compute
              demanded bits in (you call it on the phi before a, b, c,
              d),  they are all too deep for it to find an answer, it
              can't visit deep enough to compute a, b, c and d</div>
            <div><br>
            </div>
            <div>Thus, if you start from *nothing*, the answer you get
              for the phi is:<br>
              "unknown".</div>
            <div>Now, computing the phi is separate from computing the
              answer for a, b, c, d.</div>
            <div>So next, you visit a, b, c, and d, and compute their
              answers.</div>
            <div>It can successfully do so (because the depth limit
              applies to the phi, and even if it didn't, the depth limit
              for these guys is +1 what it was from the phi)</div>
            <div><br>
            </div>
            <div>Now, if you invalidate the phi in the cache (assume
              it's just an innocent victim and the value did not really
              change, it just "may have"), and ask it to compute the
              answer:<br>
              <br>
            </div>
            <div>low and behold, now it has the answers for a, b, c and
              d, and can give you an answer for the phi.</div>
            <div><br>
            </div>
            <div>So the cached version gave us a better answer.<br>
            </div>
            <div><br>
            </div>
            <div>The real reason is because it got closer to the maximal
              fixpoint than the original version.</div>
            <div><br>
              But still, much harder to verify because the answer really
              is allowed to vary depending on the order in which you ask
              it for answers.</div>
            <div><br>
            </div>
            <div>The forwards incremental computation has exactly zero
              of these problems.</div>
            <div>It always computes the same answer.</div>
            <div>It only varies the time taken to do so.</div>
            <div><br>
            </div>
            <div> <br>
            </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
              <div bgcolor="#FFFFFF">Instead of assuming by default that
                the replacement instructions has the same known bits as
                the original, we assume that it does not (and then
                recursively clear out all users).<br>
              </div>
            </blockquote>
            <div><br>
            </div>
            <div>If they were computed in the same direction, with the
              same propagation  strategy, yes, i would agree.</div>
          </div>
        </div>
      </div>
    </blockquote>
    <br></div></div>
    In general, I agree with all of this. To be clear, I posted this
    patch so that we could make some measurements of InstCombine with
    the known-bits computation cost stabilized. </div></blockquote><div>Of course. I think the work is very valuable. I was more opposed to the immediate response i saw from some others of "let's just take this and push it in because it's our only real option".</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">Given that we compute
    known bits on essentially every relevant instruction, doing this
    ahead of time makes perfect sense (i.e. aside from all of the
    disadvantages, we likely gain very little from the laziness).<br></div></blockquote><div><br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
    <br>
    I do want to explicitly clarify the following point, however. In the
    analysis as you imagine it, what exactly do we do when we replace an
    IR value (or update an instruction's flags)?<br></div></blockquote><div>In the current world, your only option is to have the optimization pass notify the analysis, explicitly, through a call, that piece of IR "X" got invalidated.</div><div>We could build something else, but i'd start there anyway, becasue the exact mechanism through which the analysis learns of invalidation doesn't change what it does internally.</div><div><br></div><div>Once it gets notification, it has two options:</div><div>1. It throws it on a list, and the next time you ask it for a result, it does the recomputation (which presents the same interface as on-demand, but isn't really the same internally)</div><div>2. It does the recomputation immediately.</div><div><br></div><div>Both are valid options.</div><div><br></div><div>It's trickier if you *also* invalidate the CFG, but that's handleable too.</div><div>i"m going to assume we aren't invalidating the CFG here for a second.</div><div><br></div><div>In practice, there is not a ton of differences between the two internally.</div><div>In *either* case, You shove the names of pieces of IR that got invalidated onto a list, grow the list to  the bounds of what needs to be recomputed, and do it.</div><div><br></div><div>For SSA based analysis that is single-pass, you can just invalidate the users.</div><div>For SSA vased analysis that is optimistic/iterative, the truly easy solution is to convert it to use SCC's of the def-use graph, which should be amazingly trivial - if you could get a maximal result based on def-use propagation, you will get a maximal result by iterating SCC's</div><div><br></div><div>1. Form SCCs of the SSA graph, tarjans will give you a topo ordering.</div><div>For each SCC in topo order, iterate your analysis on the SCC values until it stops changing.</div><div>This is guaranteed to give a maximal result, and the condensed graph of SCC's is acylic.</div><div><br></div><div>2.  When something is invalidated, you invalidate the results for the SCC, reiterate the SCC.</div><div>If the result ends up same as it was before, you are done</div><div>If it changes, you invalidate the SCC's using values from the current SCC (using the topo info), and recompute them.</div><div>Keep going along each path of using sccs in that order until result stops changing in each path.</div><div>When you run out of paths, you are done.</div><div><br></div><div><br></div><div>This is of course, not the only way, and not even guaranteed to the fastest it just makes it easy to do and verify, and is usually quite fast.</div><div><br></div><div>Note; If your value change modified the SCC structure, that too can be updated fairly easily.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">
    Thanks again,<br>
    Hal<div><div class="m_-4421022136755295261h5"><br>
    <br>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <div> <br>
            </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
              <div bgcolor="#FFFFFF"> <br>
                 -Hal<span class="m_-4421022136755295261m_-6247637909381991151gmail-"><br>
                  <br>
                  <blockquote type="cite">
                    <div dir="ltr">
                      <div class="gmail_extra">
                        <div class="gmail_quote">
                          <div>The compile time hit depends on how often
                            you want answers.  Usually people try to say
                            "i want them always up to date" (often
                            without any quantification of cost or other
                            trade-offs), which leads to trying to do it
                            on-demand</div>
                          <div><br>
                          </div>
                          <div>Caching is really  "build something that
                            is always computed on demand.  Try to make
                            it faster by storing the answers it gives
                            somewhere, and hoping that the rest of the
                            algorithm retains the ability to give
                            correct answers on demand when you remove
                            some of the stored answers".</div>
                          <div><br>
                          </div>
                          <div>One can obviously argue that incremental
                            recomputation is a form of caching, so i'm
                            trying to be concrete about my definitions.</div>
                          <div><br>
                          </div>
                          <div>usually, caching additionally requires
                            clients be able to correctly predict and
                            remove the set of answers necessary for
                            correctness (IE know enough about how the
                            algorithm works) but this isn't strictly the
                            case if designed well.</div>
                          <div><br>
                          </div>
                          <div>Caching also often ends up with degraded
                            answers over time, while most full analysis
                            can be made to not do so.  See, for example,
                            memdep, which has unresolvable cache
                            invalidation issues.  This means, for
                            example, opt -gvn -gvn gives different
                            answers than opt -gvn  | opt -gvn.</div>
                          <div><br>
                          </div>
                          <div>Both approaches obviously require
                            information about what has changed in the
                            IR.</div>
                          <div><br>
                          </div>
                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                            <div dir="ltr">
                              <div> I additionally claim that we should
                                actually quantify the performance
                                benefits of doing either over
                                "recomputing once or twice during
                                pipeline" before going down this route.
                                <div><br>
                                </div>
                                <div>We are *already* trading off
                                  compile time performance vs generated
                                  code performance. </div>
                                <div><br>
                                </div>
                                <div>All other things being equal, we
                                  should trade it off in a way that
                                  makes maintenance of code and
                                  verification of correctness as easy as
                                  possible.</div>
                                <div><br>
                                </div>
                                <div>Asserting that a good tradeoff to
                                  reduce compile time is "cache" instead
                                  of "stop doing it on demand" (or
                                  incrementally recompute), without any
                                  data other than compile time
                                  performance seems ... not particularly
                                  scientific.</div>
                                <div><br>
                                </div>
                                <div>It's certainly true that caching is
                                  often posited as "easier", but i think
                                  a trip through bugzilla would put this
                                  idea to rest.</div>
                                <div><br>
                                </div>
                                <div><br>
                                </div>
                              </div>
                            </div>
                            <div class="m_-4421022136755295261m_-6247637909381991151gmail-m_1179969366190122482HOEnZb">
                              <div class="m_-4421022136755295261m_-6247637909381991151gmail-m_1179969366190122482h5">
                                <div class="gmail_extra"><br>
                                  <div class="gmail_quote">On Fri, Mar
                                    24, 2017 at 3:33 PM, Craig Topper
                                    via Phabricator <span dir="ltr"><<a href="mailto:reviews@reviews.llvm.org" target="_blank">reviews@reviews.llvm.org</a>></span>
                                    wrote:<br>
                                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">craig.topper
                                      added a comment.<br>
                                      <br>
                                      I'm seeing more problems than just
                                      nsw/nuw flags here.<br>
                                      <br>
                                      sext.ll test is failing because
                                      SimplifyDemandedInstructions bits
                                      determined that this<br>
                                      <br>
                                      %and = and i32 %x, 16<br>
                                      <br>
                                        shl i32 %and, 27<br>
                                      <br>
                                      Simplified to just the shl because
                                      we were only demanding the MSB of
                                      the shift. This occurred after we
                                      had cached the known bits for the
                                      shl as having 31 lsbs as 0. But
                                      without the "and" in there we can
                                      no longer guarantee the lower bits
                                      of the shift result are 0.<br>
                                      <br>
                                      I also got a failure on shift.ll
                                      not reported here. This was caused
                                      by getShiftedValue rewriting
                                      operands and changing constants of
                                      other shifts. This effectively
                                      shifts the Known bits and thus the
                                      cache needs to be invalidate.<br>
                                      <br>
                                      <br>
                                      <a href="https://reviews.llvm.org/D31239" rel="noreferrer" target="_blank">https://reviews.llvm.org/D3123<wbr>9</a><br>
                                      <br>
                                      <br>
                                      <br>
                                    </blockquote>
                                  </div>
                                  <br>
                                </div>
                              </div>
                            </div>
                          </blockquote>
                        </div>
                        <br>
                      </div>
                    </div>
                  </blockquote>
                  <br>
                </span><span class="m_-4421022136755295261m_-6247637909381991151gmail-HOEnZb"><font color="#888888">
                    <pre class="m_-4421022136755295261m_-6247637909381991151gmail-m_1179969366190122482moz-signature" cols="72">-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
                  </font></span></div>
            </blockquote>
          </div>
          <br>
        </div>
      </div>
    </blockquote>
    <br>
    <pre class="m_-4421022136755295261m_-6247637909381991151moz-signature" cols="72">-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
  </div></div></div>

</blockquote></div><br></div></div>