<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p><br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 09/18/2018 02:50 PM, Alina Sbirlea

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CALV8tXMeJPGpxrFK_zQDgXyk2rPa1VNJipX5OR+YotEB-6gJZg@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=utf-8">

      <div dir="ltr"><br>

        <div class="gmail_quote">

          <div dir="ltr">On Fri, Sep 14, 2018 at 4:25 PM Philip Reames

            <<a href="mailto:listmail@philipreames.com"

              moz-do-not-send="true">listmail@philipreames.com</a>>

            wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div text="#000000" bgcolor="#FFFFFF">

              <p>This is going OT from the original thread, but, what

                the heck...</p>

            </div>

          </blockquote>

          <div>Sorry, not my intention, I was just giving another reason

            why getting promotion done in LICM differently would be

            helpful. </div>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div text="#000000" bgcolor="#FFFFFF">

              <p>Alina, can you explain the challenge with implementing

                promotion over MemorySSA?  On the surface, it seems like

                it should be fairly straight forward to provide an alias

                set like abstraction over MemorySSA.  What am I missing?</p>

            </div>

          </blockquote>

          <div>Yes, it is straight forward to have alias sets on top of

            MemorySSA, but it will likely lose most of MemorySSA's

            benefits. <br>

          </div>

        </div>

      </div>

    </blockquote>

    Ah, understood.  I think this was the core source of our confusion. 

    I was thinking "how to we migrate to using MemorySSA".  You were

    thinking "how do we exploit memory ssa".  :)<br>

    <blockquote type="cite"

cite="mid:CALV8tXMeJPGpxrFK_zQDgXyk2rPa1VNJipX5OR+YotEB-6gJZg@mail.gmail.com">

      <div dir="ltr">

        <div class="gmail_quote">

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div text="#000000" bgcolor="#FFFFFF">

              <p>Here's a sketch of how I'd think to approach this:</p>

              <p>Visit the MemoryPhi nodes in a loop.  Every interesting

                promotion case (mod in loop) must be involved in at

                least one MemoryPhi cycle.  <br>

              </p>

              <p>For each MemoryPhi in the header, create a set of all

                MemoryDef instructions involved.  (I'm saying this in

                terms of instructions, but sets of MemoryLocations

                should be equivalent.)<br>

              </p>

              <p>For each instruction in loop, identify it's (optimized)

                memory def.  If that def is outside loop, and we're

                looking at a load/store, then we can trivially

                hoist/sink (aliasing wise only).  If the def is in the

                loop, it must be involved in one of our cycles, add it

                to related alias set.</p>

              <p>When I last looked, we were debating whether MemoryPhis

                were optimized by default.  Did we end up deciding they

                weren't?  If so, I can definitely see the problem

                there.  Even then, constructing an AST for only the

                instructions involved in a loop MemoryPhi cycle should

                be pretty cheap. <br>

              </p>

            </div>

          </blockquote>

          <div>AFAIK there is no notion of MemoryPhis being optimized.

            Each block has a *single* MemoryPhi, which has the last Def

            from each incoming block. MemoryDefs and MemoryUses can be

            optimized through Phis. MemoryDefs are also not optimized

            from the start, only MemoryUses are.</div>

          <div><br>

          </div>

          <div>All MemoryDefs build a single def chain, regardless of

            whether they alias or not. Doing alias sets means checking

            alias() at least for all pairs of Defs, since alias relation

            is not transitive. That is, even if we optimize all

            MemoryDefs, we may still need additional alias() calls to

            build those sets. But the cheaper way here is not to

            optimize all Defs, but instead do alias/modref calls only

            for the Def we're processing currently to determine if

            *that* can be moved (+ cache those results in some new form

            of alias sets if a "leave this in the loop" answer is

            received).</div>

          <div><br>

          </div>

          <div>We may also need additional alias() checks for Uses,

            since, again, due to non-transitivity, a Use's clobbering

            access may be a Def with which it MayAlias (let's call it

            D1), but that D1 may be NoAlias with some D2, giving no info

            on the (Use, D2) alias relation. If NoAlias, D2 can be

            promoted, if MustAlias, the set (Use, D2) can be promoted,

            if MayAlias, we cannot promote anything.</div>

        </div>

      </div>

    </blockquote>

    Putting the above in my own words to confirm understanding:<br>

    <br>

    We can form alias sets by starting with all memory defs in the loop

    (which must be part of a single MemoryDef/MemoryPhi cycle). 

    Applying existing AS::add to these instructions gives us the set of

    possibly modifying alias sets (but without ref info yet).  We can

    then visit each MemoryUse and add it to the alias set of it's

    (optimized) MemoryDef if that def is in the loop, or introduce a new

    ref only AS if not.  <br>

    <br>

    <blockquote type="cite"

cite="mid:CALV8tXMeJPGpxrFK_zQDgXyk2rPa1VNJipX5OR+YotEB-6gJZg@mail.gmail.com">

      <div dir="ltr">

        <div class="gmail_quote">

          <div><br>

          </div>

          <div>The AliasSetTracker has a cap after which it gives up and

            merges all sets, and so will an implementation of sets with

            MemorySSA, since the cost is the same: lots of alias/modref

            calls. Granted, MemorySSA has fewer, especially for Uses,

            but we will still need a cap for pathological cases, while

            getting benefits for small loops.<br>

          </div>

        </div>

      </div>

    </blockquote>

    Agreed.<br>

    <blockquote type="cite"

cite="mid:CALV8tXMeJPGpxrFK_zQDgXyk2rPa1VNJipX5OR+YotEB-6gJZg@mail.gmail.com">

      <div dir="ltr">

        <div class="gmail_quote">

          <div><br>

          </div>

          <div>Trying to get back to the original topic. The reason

            promotion is so difficult in the current form, is that it

            promotes a whole AliasSet. And that AliasSet is built based

            on conservative alias decisions. If we were to split the

            promotion problem into a PRE for loads, then sink for

            stores, the problem becomes more manageable as far as the

            number of alias calls; we'd process one memory-accessing

            instruction at a time, and, with the right processing order,

            MemorySSA should be a better fit.</div>

        </div>

      </div>

    </blockquote>

    Er, huh?  We've already hoisted/sunk all of the loads which would

    form AliasSets w/o Mod set.  All that's left by the point we're

    doing promotion are alias sets with Mods and (optionally) Refs. 

    Aren't we pretty much stuck with doing the precise aliasing to

    ensure MustAlias at that point?  Or is your point that we can delay

    the heavy weight work until after hoist/sink has been done?  If so,

    then yes, I agree.  <br>

    <br>

    p.s. I added store hoisting to LICM recently which does require

    mustalias mod information during hoisting.  This could be separate

    into a later transform easily though.<br>

    <blockquote type="cite"

cite="mid:CALV8tXMeJPGpxrFK_zQDgXyk2rPa1VNJipX5OR+YotEB-6gJZg@mail.gmail.com">

      <div dir="ltr">

        <div class="gmail_quote">

          <div><br>

          </div>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div text="#000000" bgcolor="#FFFFFF">

              <p> </p>

              <p><br>

              </p>

              <p>Separately, can you summarize what the overall status

                of MemorySSA is?  I've lost track.  It looks like it's

                enabled by default for EarlyCSE.  So, does that mean we

                believe the bugs have been worked out and we "just" need

                to plumb it through the pipeline? <br>

              </p>

            </div>

          </blockquote>

          <div>Sure!</div>

          <div><br>

          </div>

          <div>MemorySSA is fairly stable right now (i.e. no known

            bugs). I've been working on extending the APIs for updating

            it (e.g. <span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D45299" style="text-decoration-line:none" moz-do-not-send="true">D45299</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D48396" style="text-decoration-line:none" moz-do-not-send="true">D48396</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D48897" style="text-decoration-line:none" moz-do-not-send="true">D48897</a></span>),

            and added the plumbing to update it in the Utils (<span style="font-family:Arial;font-size:11pt;white-space:pre-wrap;text-decoration-line:underline;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline"><a href="https://reviews.llvm.org/D48790" style="font-family:Arial;font-size:14.6667px;white-space:pre;text-decoration-line:none" moz-do-not-send="true">D48790</a>, </span><span style="font-family:Arial;font-size:11pt;white-space:pre-wrap;text-decoration-line:underline;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline"><a href="https://reviews.llvm.org/D45300" style="font-family:Arial;font-size:14.6667px;white-space:pre;text-decoration-line:none" moz-do-not-send="true">D45300</a></span>)

            and through the loop pass pipeline (<span style="font-family:Arial;font-size:11pt;white-space:pre-wrap;text-decoration-line:underline;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline"><a href="https://reviews.llvm.org/D50906" style="font-family:Arial;font-size:14.6667px;white-space:pre;text-decoration-line:none" moz-do-not-send="true">D50906</a>, </span><span style="font-family:Arial;font-size:11pt;white-space:pre-wrap;text-decoration-line:underline;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline"><a href="https://reviews.llvm.org/D50911" style="font-family:Arial;font-size:14.6667px;white-space:pre;text-decoration-line:none" moz-do-not-send="true">D50911</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D45301" style="text-decoration-line:none" moz-do-not-send="true">D45301</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D47022" style="text-decoration-line:none" moz-do-not-send="true">D47022</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D51718" style="text-decoration-line:none" moz-do-not-send="true">D51718</a></span>). </div>

          <div>LICM is the loop pass I've picked back up now, and the

            first that also uses MemorySSA vs just updating it. Patches

            in Phabricator are fairly old (<span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D40375" style="text-decoration-line:none" moz-do-not-send="true">D40375</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D35741" style="text-decoration-line:none" moz-do-not-send="true">D35741</a></span>).

            I'm going to upload the rebased and updated versions once

            they're ready for review.</div>

          <div>The actual use/update is currently disabled by default

            (flag to enable: EnableMSSALoopDependency).</div>

          <div><br>

          </div>

          <div>I also recently became aware of EarlyCSE using MemorySSA.

            There are no correctness issues with that but there may be

            missed optimizations in subsequent passes using MemorySSA (<span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D51960" style="text-decoration-line:none" moz-do-not-send="true">D51960</a></span> for

            details and to avoid going even more off-topic) .</div>

          <div><br>

          </div>

          <div>Thanks,</div>

          <div>Alina</div>

          <div><br>

            ><br>

            > Philip</div>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div text="#000000" bgcolor="#FFFFFF"> <br>

              <div class="m_4777266497825329610moz-cite-prefix">On

                09/13/2018 02:00 PM, Alina Sbirlea wrote:<br>

              </div>

              <blockquote type="cite">

                <div dir="ltr">For context, I've been looking into

                  replacing the use of AST (AliasSetTracker) with

                  MemorySSA in LICM. This works great for

                  sinking/hoisting but does not apply well for

                  promotion, and one of the solutions I considered is

                  precisely ripping out the promotion part and replacing

                  its functionality with a separate PRE pass + possibly

                  store sinking. FWIW I think that's the right way to

                  go.

                  <div>I did not get deep enough into working on the

                    solution but I would gladly have a detailed

                    discussion to move this forward.</div>

                  <div><br>

                  </div>

                  <div>Reading into detail now.</div>

                  <div><br>

                  </div>

                  <div>Thanks,</div>

                  <div>Alina</div>

                </div>

                <br>

                <div class="gmail_quote">

                  <div dir="ltr">On Thu, Sep 13, 2018 at 1:43 PM Philip

                    Reames <<a

                      href="mailto:listmail@philipreames.com"

                      target="_blank" moz-do-not-send="true">listmail@philipreames.com</a>>

                    wrote:<br>

                  </div>

                  <blockquote class="gmail_quote" style="margin:0 0 0

                    .8ex;border-left:1px #ccc solid;padding-left:1ex">

                    <div text="#000000" bgcolor="#FFFFFF"> (minor inline

                      additions)<br>

                      <br>

                      On 09/13/2018 01:51 AM, Chandler Carruth wrote:<br>

                      <blockquote type="cite">

                        <div dir="ltr">Haven't had time to dig into

                          this, but wanted to add <a

                            class="m_4777266497825329610m_-2310710337215009520GWVZpf

m_4777266497825329610m_-2310710337215009520gW"

                            id="m_4777266497825329610m_-2310710337215009520IloFPc-1"

                            href="mailto:asbirlea@google.com"

                            target="_blank" moz-do-not-send="true">+Alina

                            Sbirlea</a> to the thread as she has been

                          working on promotion and other aspects of LICM

                          for a long time here.</div>

                      </blockquote>

                      Thanks!<br>

                      <blockquote type="cite">

                        <div class="gmail_quote">

                          <div dir="ltr">On Wed, Sep 12, 2018 at 11:41

                            PM Philip Reames <<a

                              href="mailto:listmail@philipreames.com"

                              target="_blank" moz-do-not-send="true">listmail@philipreames.com</a>>

                            wrote:<br>

                          </div>

                          <blockquote class="gmail_quote"

                            style="margin:0 0 0 .8ex;border-left:1px

                            #ccc solid;padding-left:1ex">

                            <div text="#000000" bgcolor="#FFFFFF">

                              <p>I'm thinking about making some semi

                                radical changes to load store promotion

                                works in LICM, and figured it would be a

                                good idea to get buy in before I

                                actually started writing code.  :)</p>

                              <p>TLDR: legality of sinking stores to

                                exits is hard, can we separate load

                                handling into a super aggressive form of

                                PRE, and use predicated stores to avoid

                                solving legality question?<br>

                              </p>

                              <p><br>

                              </p>

                              <p>Background<br>

                              </p>

                              <p>We've been seeing an interesting class

                                of problems recently that looks roughly

                                like this:</p>

                              <p>for (int = 0; i < N; i++)<br>

                                  if (a[i] == 0) // some data dependent

                                check<br>

                                    g_count++; // conditional load and

                                store to shared location<br>

                              </p>

                              <p>The critical detail in this example is

                                that g_count is a global location which

                                may be accessed concurrently* by

                                multiple threads.  The challenge here is

                                that we don't know whether the store

                                ever executes, and we're not allowed to

                                insert a store along any path that

                                didn't originally contain them.  Said

                                differently, if all elements in "a" are

                                non-zero, we're not allowed to store to

                                g_count.  We do know that the g_count

                                location is dereferenceable though.  <br>

                              </p>

                              <p>(*Please, let's avoid the memory model

                                derailment here.  I'm simplifying and

                                C++ language rules aren't real useful

                                for my Java language frontend anyways. 

                                In practice, all the access are atomic,

                                but unordered, but we'll leave that out

                                of discussion otherwise.)</p>

                              <p>I have two approaches I'm considering

                                here.  These are orthogonal, but I

                                suspect we'll want to implement both.</p>

                              <p><br>

                              </p>

                              <p>Proposal A - Use predicated stores in

                                loop exits</p>

                              <p>The basic idea is that we don't worry

                                about solving the legality question

                                above, and just insert a store which is

                                predicated on a condition which is true

                                exactly when the original store ran.  In

                                pseudo code, this looks something like:</p>

                              <p>bool StoreLegal = false;<br>

                                int LocalCount = g_count;<br>

                                for (int = 0; i < N; i++)<br>

                                  if (a[i] == 0) {<br>

                                    LocalCount++;<br>

                                    StoreLegal = true;<br>

                                  }<br>

                                if (StoreLegal) g_count = LocalCount; <br>

                              </p>

                              <p>There are two obvious concerns here:</p>

                              <ol>

                                <li>The predicated store might be

                                  expensive in practice - true for most

                                  current architectures.<br>

                                </li>

                                <li>We''re introducing an extra boolean

                                  phi cycle around the loop.  <br>

                                </li>

                              </ol>

                              <p>Talking to a couple of folks offline at

                                the socials over the last few months,

                                the former seems to be the main

                                objection.  I think we can control this

                                by restricting this transform to when

                                the loop is believed to have a high trip

                                count and the conditional path is taken

                                with some frequency.   Getting feedback

                                on this particular point is one of my

                                main reasons for writing this mail.  <br>

                              </p>

                              <p>The second objection can frequently be

                                resolved by finding other expressions

                                which evaluate to the same boolean.  (In

                                this case, if LocalCount !=

                                LocalCountOrig assuming i doesn't

                                overflow.)  We already have a framework

                                with SCEV to do these replacements. 

                                Though, from some quick testing, it

                                would definitely need strengthening. 

                                However, SCEV can't remove the extra phi

                                in all cases, so we have to be okay with

                                the extra phi cycle in the general

                                case.  This seems worthwhile to me, but

                                thoughts?<br>

                              </p>

                              <p><br>

                              </p>

                              <p>Proposal B - Separate load and store

                                handling into distinct transforms</p>

                              <p>(For folks I've discussed this with

                                before, this part is all new.)</p>

                              <p>Thinking about the problem, it finally

                                occurred to me that we can decompose the

                                original example into two steps: getting

                                the loads out of the loop, and sinking

                                the stores out of the loop.  If we can

                                accomplish the former, but not the

                                later, we've still made meaningful

                                progress.  <br>

                              </p>

                              <p>So, what'd we'd essentially have is a

                                load only transformation which produces

                                this:<br>

                                int LocalCount = g_count;<br>

                                for (int = 0; i < N; i++)<br>

                                  if (a[i] == 0) {<br>

                                    LocalCount++;<br>

                                    g_count = LocalCount;<br>

                                  }</p>

                              <p>At this point, we've reduced memory

                                traffic by half, and opened up the

                                possibility that other parts of the

                                optimizer can exploit the simplified

                                form.  The last point is particular

                                interesting since we generally try to

                                canonicalize loads out of loops, and

                                much of the optimizer is tuned for a

                                form with as much as possible being loop

                                invariant.  As one example, completely

                                by accident, there's some work going on

                                in the LoopVectorizer right now to

                                handle such stores to loop invariant

                                addresses during vectorization.  Putting

                                the two pieces together would let us

                                fully vectorize this loop without

                                needing to sink the stores at all.   </p>

                              <p>In practice, the existing

                                implementation in LICM would cleanly

                                split along these lines with little

                                problem.  <br>

                              </p>

                              <p>One way of looking at this load

                                specific transform is as an extreme form

                                of PRE (partial redundancy

                                elimination).  Our current PRE

                                implementation doesn't handle cases this

                                complicated. <br>

                              </p>

                            </div>

                          </blockquote>

                        </div>

                      </blockquote>

                      It occurred to my later that simply framing the

                      new transform as a separate pass (LoopPRE) and

                      using the same AST + SSA construction approach

                      would be straight forward.  So, if folks think

                      that having an aggressive form of load PRE in LICM

                      is going a bit too far, it'd be easy to represent

                      as an optional separate pass.  I'd still prefer

                      having LICM contain the logic though.  <br>

                      <blockquote type="cite">

                        <div class="gmail_quote">

                          <blockquote class="gmail_quote"

                            style="margin:0 0 0 .8ex;border-left:1px

                            #ccc solid;padding-left:1ex">

                            <div text="#000000" bgcolor="#FFFFFF">

                              <p> </p>

                              <p>Thoughts?</p>

                            </div>

                            <div text="#000000" bgcolor="#FFFFFF">

                              <p>Philip<br>

                              </p>

                            </div>

                          </blockquote>

                        </div>

                      </blockquote>

                      <br>

                    </div>

                  </blockquote>

                </div>

              </blockquote>

              <br>

            </div>

          </blockquote>

        </div>

      </div>

    </blockquote>

    <br>

  </body>

</html>