<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Wed, Sep 19, 2018 at 10:06 AM Philip Reames via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div text="#000000" bgcolor="#FFFFFF">
    <p><br>
    </p>
    <br>
    <div class="m_4704851736071475128moz-cite-prefix">On 09/18/2018 02:50 PM, Alina Sbirlea
      wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr"><br>
        <div class="gmail_quote">
          <div dir="ltr">On Fri, Sep 14, 2018 at 4:25 PM Philip Reames
            <<a href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>>
            wrote:<br>
          </div>
          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div text="#000000" bgcolor="#FFFFFF">
              <p>This is going OT from the original thread, but, what
                the heck...</p>
            </div>
          </blockquote>
          <div>Sorry, not my intention, I was just giving another reason
            why getting promotion done in LICM differently would be
            helpful. </div>
          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div text="#000000" bgcolor="#FFFFFF">
              <p>Alina, can you explain the challenge with implementing
                promotion over MemorySSA?  On the surface, it seems like
                it should be fairly straight forward to provide an alias
                set like abstraction over MemorySSA.  What am I missing?</p>
            </div>
          </blockquote>
          <div>Yes, it is straight forward to have alias sets on top of
            MemorySSA, but it will likely lose most of MemorySSA's
            benefits. <br>
          </div>
        </div>
      </div>
    </blockquote>
    Ah, understood.  I think this was the core source of our confusion. 
    I was thinking "how to we migrate to using MemorySSA".  You were
    thinking "how do we exploit memory ssa".  :)<br>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_quote">
          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div text="#000000" bgcolor="#FFFFFF">
              <p>Here's a sketch of how I'd think to approach this:</p>
              <p>Visit the MemoryPhi nodes in a loop.  Every interesting
                promotion case (mod in loop) must be involved in at
                least one MemoryPhi cycle.  <br>
              </p>
              <p>For each MemoryPhi in the header, create a set of all
                MemoryDef instructions involved.  (I'm saying this in
                terms of instructions, but sets of MemoryLocations
                should be equivalent.)<br>
              </p>
              <p>For each instruction in loop, identify it's (optimized)
                memory def.  If that def is outside loop, and we're
                looking at a load/store, then we can trivially
                hoist/sink (aliasing wise only).  If the def is in the
                loop, it must be involved in one of our cycles, add it
                to related alias set.</p>
              <p>When I last looked, we were debating whether MemoryPhis
                were optimized by default.  Did we end up deciding they
                weren't?  If so, I can definitely see the problem
                there.  Even then, constructing an AST for only the
                instructions involved in a loop MemoryPhi cycle should
                be pretty cheap. <br>
              </p>
            </div>
          </blockquote>
          <div>AFAIK there is no notion of MemoryPhis being optimized.
            Each block has a *single* MemoryPhi, which has the last Def
            from each incoming block. MemoryDefs and MemoryUses can be
            optimized through Phis. MemoryDefs are also not optimized
            from the start, only MemoryUses are.</div>
          <div><br>
          </div>
          <div>All MemoryDefs build a single def chain, regardless of
            whether they alias or not. Doing alias sets means checking
            alias() at least for all pairs of Defs, since alias relation
            is not transitive. That is, even if we optimize all
            MemoryDefs, we may still need additional alias() calls to
            build those sets. But the cheaper way here is not to
            optimize all Defs, but instead do alias/modref calls only
            for the Def we're processing currently to determine if
            *that* can be moved (+ cache those results in some new form
            of alias sets if a "leave this in the loop" answer is
            received).</div>
          <div><br>
          </div>
          <div>We may also need additional alias() checks for Uses,
            since, again, due to non-transitivity, a Use's clobbering
            access may be a Def with which it MayAlias (let's call it
            D1), but that D1 may be NoAlias with some D2, giving no info
            on the (Use, D2) alias relation. If NoAlias, D2 can be
            promoted, if MustAlias, the set (Use, D2) can be promoted,
            if MayAlias, we cannot promote anything.</div>
        </div>
      </div>
    </blockquote>
    Putting the above in my own words to confirm understanding:<br>
    <br>
    We can form alias sets by starting with all memory defs in the loop
    (which must be part of a single MemoryDef/MemoryPhi cycle). 
    Applying existing AS::add to these instructions gives us the set of
    possibly modifying alias sets (but without ref info yet).  We can
    then visit each MemoryUse and add it to the alias set of it's
    (optimized) MemoryDef if that def is in the loop, or introduce a new
    ref only AS if not.  <br></div></blockquote><div><br></div><div>AFAICT, that would be a correct implementation.</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div text="#000000" bgcolor="#FFFFFF">
    <br>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_quote">
          <div><br>
          </div>
          <div>The AliasSetTracker has a cap after which it gives up and
            merges all sets, and so will an implementation of sets with
            MemorySSA, since the cost is the same: lots of alias/modref
            calls. Granted, MemorySSA has fewer, especially for Uses,
            but we will still need a cap for pathological cases, while
            getting benefits for small loops.<br>
          </div>
        </div>
      </div>
    </blockquote>
    Agreed.<br>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_quote">
          <div><br>
          </div>
          <div>Trying to get back to the original topic. The reason
            promotion is so difficult in the current form, is that it
            promotes a whole AliasSet. And that AliasSet is built based
            on conservative alias decisions. If we were to split the
            promotion problem into a PRE for loads, then sink for
            stores, the problem becomes more manageable as far as the
            number of alias calls; we'd process one memory-accessing
            instruction at a time, and, with the right processing order,
            MemorySSA should be a better fit.</div>
        </div>
      </div>
    </blockquote>
    Er, huh?  We've already hoisted/sunk all of the loads which would
    form AliasSets w/o Mod set.  All that's left by the point we're
    doing promotion are alias sets with Mods and (optionally) Refs. 
    Aren't we pretty much stuck with doing the precise aliasing to
    ensure MustAlias at that point?  Or is your point that we can delay
    the heavy weight work until after hoist/sink has been done?  If so,
    then yes, I agree.  <br></div></blockquote><div><br></div><div>Yes, we would ideally delay this until after hoist/sink is done. It's also why in the draft promotion I had, I was attempting to build alias sets just before promotion, not when LICM starts.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div text="#000000" bgcolor="#FFFFFF">
    <br>
    p.s. I added store hoisting to LICM recently which does require
    mustalias mod information during hoisting.  This could be separate
    into a later transform easily though.<br></div></blockquote><div> </div><div>Yes, I noticed. It's the one thing MemorySSA cannot match right now for sinking due to needing alias sets :) (before this change, it was just promotion).</div><div>But it is a well justified change for the AST; if it spent all that time computing the sets already, might as well use that.</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div text="#000000" bgcolor="#FFFFFF">
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_quote">
          <div><br>
          </div>
          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div text="#000000" bgcolor="#FFFFFF">
              <p> </p>
              <p><br>
              </p>
              <p>Separately, can you summarize what the overall status
                of MemorySSA is?  I've lost track.  It looks like it's
                enabled by default for EarlyCSE.  So, does that mean we
                believe the bugs have been worked out and we "just" need
                to plumb it through the pipeline? <br>
              </p>
            </div>
          </blockquote>
          <div>Sure!</div>
          <div><br>
          </div>
          <div>MemorySSA is fairly stable right now (i.e. no known
            bugs). I've been working on extending the APIs for updating
            it (e.g. <span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D45299" style="text-decoration-line:none" target="_blank">D45299</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D48396" style="text-decoration-line:none" target="_blank">D48396</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D48897" style="text-decoration-line:none" target="_blank">D48897</a></span>),
            and added the plumbing to update it in the Utils (<span style="font-family:Arial;font-size:11pt;white-space:pre-wrap;text-decoration-line:underline;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline"><a href="https://reviews.llvm.org/D48790" style="font-family:Arial;font-size:14.6667px;white-space:pre-wrap;text-decoration-line:none" target="_blank">D48790</a>, </span><span style="font-family:Arial;font-size:11pt;white-space:pre-wrap;text-decoration-line:underline;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline"><a href="https://reviews.llvm.org/D45300" style="font-family:Arial;font-size:14.6667px;white-space:pre-wrap;text-decoration-line:none" target="_blank">D45300</a></span>)
            and through the loop pass pipeline (<span style="font-family:Arial;font-size:11pt;white-space:pre-wrap;text-decoration-line:underline;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline"><a href="https://reviews.llvm.org/D50906" style="font-family:Arial;font-size:14.6667px;white-space:pre-wrap;text-decoration-line:none" target="_blank">D50906</a>, </span><span style="font-family:Arial;font-size:11pt;white-space:pre-wrap;text-decoration-line:underline;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline"><a href="https://reviews.llvm.org/D50911" style="font-family:Arial;font-size:14.6667px;white-space:pre-wrap;text-decoration-line:none" target="_blank">D50911</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D45301" style="text-decoration-line:none" target="_blank">D45301</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D47022" style="text-decoration-line:none" target="_blank">D47022</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D51718" style="text-decoration-line:none" target="_blank">D51718</a></span>). </div>
          <div>LICM is the loop pass I've picked back up now, and the
            first that also uses MemorySSA vs just updating it. Patches
            in Phabricator are fairly old (<span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D40375" style="text-decoration-line:none" target="_blank">D40375</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D35741" style="text-decoration-line:none" target="_blank">D35741</a></span>).
            I'm going to upload the rebased and updated versions once
            they're ready for review.</div>
          <div>The actual use/update is currently disabled by default
            (flag to enable: EnableMSSALoopDependency).</div>
          <div><br>
          </div>
          <div>I also recently became aware of EarlyCSE using MemorySSA.
            There are no correctness issues with that but there may be
            missed optimizations in subsequent passes using MemorySSA (<span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D51960" style="text-decoration-line:none" target="_blank">D51960</a></span> for
            details and to avoid going even more off-topic) .</div>
          <div><br>
          </div>
          <div>Thanks,</div>
          <div>Alina</div>
          <div><br>
            ><br>
            > Philip</div>
          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div text="#000000" bgcolor="#FFFFFF"> <br>
              <div class="m_4704851736071475128m_4777266497825329610moz-cite-prefix">On
                09/13/2018 02:00 PM, Alina Sbirlea wrote:<br>
              </div>
              <blockquote type="cite">
                <div dir="ltr">For context, I've been looking into
                  replacing the use of AST (AliasSetTracker) with
                  MemorySSA in LICM. This works great for
                  sinking/hoisting but does not apply well for
                  promotion, and one of the solutions I considered is
                  precisely ripping out the promotion part and replacing
                  its functionality with a separate PRE pass + possibly
                  store sinking. FWIW I think that's the right way to
                  go.
                  <div>I did not get deep enough into working on the
                    solution but I would gladly have a detailed
                    discussion to move this forward.</div>
                  <div><br>
                  </div>
                  <div>Reading into detail now.</div>
                  <div><br>
                  </div>
                  <div>Thanks,</div>
                  <div>Alina</div>
                </div>
                <br>
                <div class="gmail_quote">
                  <div dir="ltr">On Thu, Sep 13, 2018 at 1:43 PM Philip
                    Reames <<a href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>>
                    wrote:<br>
                  </div>
                  <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                    <div text="#000000" bgcolor="#FFFFFF"> (minor inline
                      additions)<br>
                      <br>
                      On 09/13/2018 01:51 AM, Chandler Carruth wrote:<br>
                      <blockquote type="cite">
                        <div dir="ltr">Haven't had time to dig into
                          this, but wanted to add <a class="m_4704851736071475128m_4777266497825329610m_-2310710337215009520GWVZpf
m_4777266497825329610m_-2310710337215009520gW" id="m_4704851736071475128m_4777266497825329610m_-2310710337215009520IloFPc-1" href="mailto:asbirlea@google.com" target="_blank">+Alina
                            Sbirlea</a> to the thread as she has been
                          working on promotion and other aspects of LICM
                          for a long time here.</div>
                      </blockquote>
                      Thanks!<br>
                      <blockquote type="cite">
                        <div class="gmail_quote">
                          <div dir="ltr">On Wed, Sep 12, 2018 at 11:41
                            PM Philip Reames <<a href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>>
                            wrote:<br>
                          </div>
                          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                            <div text="#000000" bgcolor="#FFFFFF">
                              <p>I'm thinking about making some semi
                                radical changes to load store promotion
                                works in LICM, and figured it would be a
                                good idea to get buy in before I
                                actually started writing code.  :)</p>
                              <p>TLDR: legality of sinking stores to
                                exits is hard, can we separate load
                                handling into a super aggressive form of
                                PRE, and use predicated stores to avoid
                                solving legality question?<br>
                              </p>
                              <p><br>
                              </p>
                              <p>Background<br>
                              </p>
                              <p>We've been seeing an interesting class
                                of problems recently that looks roughly
                                like this:</p>
                              <p>for (int = 0; i < N; i++)<br>
                                  if (a[i] == 0) // some data dependent
                                check<br>
                                    g_count++; // conditional load and
                                store to shared location<br>
                              </p>
                              <p>The critical detail in this example is
                                that g_count is a global location which
                                may be accessed concurrently* by
                                multiple threads.  The challenge here is
                                that we don't know whether the store
                                ever executes, and we're not allowed to
                                insert a store along any path that
                                didn't originally contain them.  Said
                                differently, if all elements in "a" are
                                non-zero, we're not allowed to store to
                                g_count.  We do know that the g_count
                                location is dereferenceable though.  <br>
                              </p>
                              <p>(*Please, let's avoid the memory model
                                derailment here.  I'm simplifying and
                                C++ language rules aren't real useful
                                for my Java language frontend anyways. 
                                In practice, all the access are atomic,
                                but unordered, but we'll leave that out
                                of discussion otherwise.)</p>
                              <p>I have two approaches I'm considering
                                here.  These are orthogonal, but I
                                suspect we'll want to implement both.</p>
                              <p><br>
                              </p>
                              <p>Proposal A - Use predicated stores in
                                loop exits</p>
                              <p>The basic idea is that we don't worry
                                about solving the legality question
                                above, and just insert a store which is
                                predicated on a condition which is true
                                exactly when the original store ran.  In
                                pseudo code, this looks something like:</p>
                              <p>bool StoreLegal = false;<br>
                                int LocalCount = g_count;<br>
                                for (int = 0; i < N; i++)<br>
                                  if (a[i] == 0) {<br>
                                    LocalCount++;<br>
                                    StoreLegal = true;<br>
                                  }<br>
                                if (StoreLegal) g_count = LocalCount; <br>
                              </p>
                              <p>There are two obvious concerns here:</p>
                              <ol>
                                <li>The predicated store might be
                                  expensive in practice - true for most
                                  current architectures.<br>
                                </li>
                                <li>We''re introducing an extra boolean
                                  phi cycle around the loop.  <br>
                                </li>
                              </ol>
                              <p>Talking to a couple of folks offline at
                                the socials over the last few months,
                                the former seems to be the main
                                objection.  I think we can control this
                                by restricting this transform to when
                                the loop is believed to have a high trip
                                count and the conditional path is taken
                                with some frequency.   Getting feedback
                                on this particular point is one of my
                                main reasons for writing this mail.  <br>
                              </p>
                              <p>The second objection can frequently be
                                resolved by finding other expressions
                                which evaluate to the same boolean.  (In
                                this case, if LocalCount !=
                                LocalCountOrig assuming i doesn't
                                overflow.)  We already have a framework
                                with SCEV to do these replacements. 
                                Though, from some quick testing, it
                                would definitely need strengthening. 
                                However, SCEV can't remove the extra phi
                                in all cases, so we have to be okay with
                                the extra phi cycle in the general
                                case.  This seems worthwhile to me, but
                                thoughts?<br>
                              </p>
                              <p><br>
                              </p>
                              <p>Proposal B - Separate load and store
                                handling into distinct transforms</p>
                              <p>(For folks I've discussed this with
                                before, this part is all new.)</p>
                              <p>Thinking about the problem, it finally
                                occurred to me that we can decompose the
                                original example into two steps: getting
                                the loads out of the loop, and sinking
                                the stores out of the loop.  If we can
                                accomplish the former, but not the
                                later, we've still made meaningful
                                progress.  <br>
                              </p>
                              <p>So, what'd we'd essentially have is a
                                load only transformation which produces
                                this:<br>
                                int LocalCount = g_count;<br>
                                for (int = 0; i < N; i++)<br>
                                  if (a[i] == 0) {<br>
                                    LocalCount++;<br>
                                    g_count = LocalCount;<br>
                                  }</p>
                              <p>At this point, we've reduced memory
                                traffic by half, and opened up the
                                possibility that other parts of the
                                optimizer can exploit the simplified
                                form.  The last point is particular
                                interesting since we generally try to
                                canonicalize loads out of loops, and
                                much of the optimizer is tuned for a
                                form with as much as possible being loop
                                invariant.  As one example, completely
                                by accident, there's some work going on
                                in the LoopVectorizer right now to
                                handle such stores to loop invariant
                                addresses during vectorization.  Putting
                                the two pieces together would let us
                                fully vectorize this loop without
                                needing to sink the stores at all.   </p>
                              <p>In practice, the existing
                                implementation in LICM would cleanly
                                split along these lines with little
                                problem.  <br>
                              </p>
                              <p>One way of looking at this load
                                specific transform is as an extreme form
                                of PRE (partial redundancy
                                elimination).  Our current PRE
                                implementation doesn't handle cases this
                                complicated. <br>
                              </p>
                            </div>
                          </blockquote>
                        </div>
                      </blockquote>
                      It occurred to my later that simply framing the
                      new transform as a separate pass (LoopPRE) and
                      using the same AST + SSA construction approach
                      would be straight forward.  So, if folks think
                      that having an aggressive form of load PRE in LICM
                      is going a bit too far, it'd be easy to represent
                      as an optional separate pass.  I'd still prefer
                      having LICM contain the logic though.  <br>
                      <blockquote type="cite">
                        <div class="gmail_quote">
                          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                            <div text="#000000" bgcolor="#FFFFFF">
                              <p> </p>
                              <p>Thoughts?</p>
                            </div>
                            <div text="#000000" bgcolor="#FFFFFF">
                              <p>Philip<br>
                              </p>
                            </div>
                          </blockquote>
                        </div>
                      </blockquote>
                      <br>
                    </div>
                  </blockquote>
                </div>
              </blockquote>
              <br>
            </div>
          </blockquote>
        </div>
      </div>
    </blockquote>
    <br>
  </div>

_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
</blockquote></div></div>