<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>This is going OT from the original thread, but, what the heck...</p>

    <p>Alina, can you explain the challenge with implementing promotion

      over MemorySSA?  On the surface, it seems like it should be fairly

      straight forward to provide an alias set like abstraction over

      MemorySSA.  What am I missing?</p>

    <p>Here's a sketch of how I'd think to approach this:</p>

    <p>Visit the MemoryPhi nodes in a loop.  Every interesting promotion

      case (mod in loop) must be involved in at least one MemoryPhi

      cycle.  <br>

    </p>

    <p>For each MemoryPhi in the header, create a set of all MemoryDef

      instructions involved.  (I'm saying this in terms of instructions,

      but sets of MemoryLocations should be equivalent.)<br>

    </p>

    <p>For each instruction in loop, identify it's (optimized) memory

      def.  If that def is outside loop, and we're looking at a

      load/store, then we can trivially hoist/sink (aliasing wise

      only).  If the def is in the loop, it must be involved in one of

      our cycles, add it to related alias set.</p>

    <p>When I last looked, we were debating whether MemoryPhis were

      optimized by default.  Did we end up deciding they weren't?  If

      so, I can definitely see the problem there.  Even then,

      constructing an AST for only the instructions involved in a loop

      MemoryPhi cycle should be pretty cheap. <br>

    </p>

    <p><br>

    </p>

    <p>Separately, can you summarize what the overall status of

      MemorySSA is?  I've lost track.  It looks like it's enabled by

      default for EarlyCSE.  So, does that mean we believe the bugs have

      been worked out and we "just" need to plumb it through the

      pipeline?  <br>

    </p>

    <p>Philip<br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 09/13/2018 02:00 PM, Alina Sbirlea

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CALV8tXPGY5_3w7ceazST71nqxnkLXZVbd1wWZDBEpwSdgR9vPg@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=utf-8">

      <div dir="ltr">For context, I've been looking into replacing the

        use of AST (AliasSetTracker) with MemorySSA in LICM. This works

        great for sinking/hoisting but does not apply well for

        promotion, and one of the solutions I considered is precisely

        ripping out the promotion part and replacing its functionality

        with a separate PRE pass + possibly store sinking. FWIW I think

        that's the right way to go.

        <div>I did not get deep enough into working on the solution but

          I would gladly have a detailed discussion to move this

          forward.</div>

        <div><br>

        </div>

        <div>Reading into detail now.</div>

        <div><br>

        </div>

        <div>Thanks,</div>

        <div>Alina</div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr">On Thu, Sep 13, 2018 at 1:43 PM Philip Reames

          <<a href="mailto:listmail@philipreames.com"

            moz-do-not-send="true">listmail@philipreames.com</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">

          <div text="#000000" bgcolor="#FFFFFF"> (minor inline

            additions)<br>

            <br>

            On 09/13/2018 01:51 AM, Chandler Carruth wrote:<br>

            <blockquote type="cite">

              <div dir="ltr">Haven't had time to dig into this, but

                wanted to add <a class="m_-2310710337215009520GWVZpf

                  m_-2310710337215009520gW"

                  id="m_-2310710337215009520IloFPc-1"

                  href="mailto:asbirlea@google.com" target="_blank"

                  moz-do-not-send="true">+Alina Sbirlea</a> to the

                thread as she has been working on promotion and other

                aspects of LICM for a long time here.</div>

            </blockquote>

            Thanks!<br>

            <blockquote type="cite">

              <div class="gmail_quote">

                <div dir="ltr">On Wed, Sep 12, 2018 at 11:41 PM Philip

                  Reames <<a href="mailto:listmail@philipreames.com"

                    target="_blank" moz-do-not-send="true">listmail@philipreames.com</a>>

                  wrote:<br>

                </div>

                <blockquote class="gmail_quote" style="margin:0 0 0

                  .8ex;border-left:1px #ccc solid;padding-left:1ex">

                  <div text="#000000" bgcolor="#FFFFFF">

                    <p>I'm thinking about making some semi radical

                      changes to load store promotion works in LICM, and

                      figured it would be a good idea to get buy in

                      before I actually started writing code.  :)</p>

                    <p>TLDR: legality of sinking stores to exits is

                      hard, can we separate load handling into a super

                      aggressive form of PRE, and use predicated stores

                      to avoid solving legality question?<br>

                    </p>

                    <p><br>

                    </p>

                    <p>Background<br>

                    </p>

                    <p>We've been seeing an interesting class of

                      problems recently that looks roughly like this:</p>

                    <p>for (int = 0; i < N; i++)<br>

                        if (a[i] == 0) // some data dependent check<br>

                          g_count++; // conditional load and store to

                      shared location<br>

                    </p>

                    <p>The critical detail in this example is that

                      g_count is a global location which may be accessed

                      concurrently* by multiple threads.  The challenge

                      here is that we don't know whether the store ever

                      executes, and we're not allowed to insert a store

                      along any path that didn't originally contain

                      them.  Said differently, if all elements in "a"

                      are non-zero, we're not allowed to store to

                      g_count.  We do know that the g_count location is

                      dereferenceable though.  <br>

                    </p>

                    <p>(*Please, let's avoid the memory model derailment

                      here.  I'm simplifying and C++ language rules

                      aren't real useful for my Java language frontend

                      anyways.  In practice, all the access are atomic,

                      but unordered, but we'll leave that out of

                      discussion otherwise.)</p>

                    <p>I have two approaches I'm considering here. 

                      These are orthogonal, but I suspect we'll want to

                      implement both.</p>

                    <p><br>

                    </p>

                    <p>Proposal A - Use predicated stores in loop exits</p>

                    <p>The basic idea is that we don't worry about

                      solving the legality question above, and just

                      insert a store which is predicated on a condition

                      which is true exactly when the original store

                      ran.  In pseudo code, this looks something like:</p>

                    <p>bool StoreLegal = false;<br>

                      int LocalCount = g_count;<br>

                      for (int = 0; i < N; i++)<br>

                        if (a[i] == 0) {<br>

                          LocalCount++;<br>

                          StoreLegal = true;<br>

                        }<br>

                      if (StoreLegal) g_count = LocalCount; <br>

                    </p>

                    <p>There are two obvious concerns here:</p>

                    <ol>

                      <li>The predicated store might be expensive in

                        practice - true for most current architectures.<br>

                      </li>

                      <li>We''re introducing an extra boolean phi cycle

                        around the loop.  <br>

                      </li>

                    </ol>

                    <p>Talking to a couple of folks offline at the

                      socials over the last few months, the former seems

                      to be the main objection.  I think we can control

                      this by restricting this transform to when the

                      loop is believed to have a high trip count and the

                      conditional path is taken with some frequency.  

                      Getting feedback on this particular point is one

                      of my main reasons for writing this mail.  <br>

                    </p>

                    <p>The second objection can frequently be resolved

                      by finding other expressions which evaluate to the

                      same boolean.  (In this case, if LocalCount !=

                      LocalCountOrig assuming i doesn't overflow.)  We

                      already have a framework with SCEV to do these

                      replacements.  Though, from some quick testing, it

                      would definitely need strengthening.  However,

                      SCEV can't remove the extra phi in all cases, so

                      we have to be okay with the extra phi cycle in the

                      general case.  This seems worthwhile to me, but

                      thoughts?<br>

                    </p>

                    <p><br>

                    </p>

                    <p>Proposal B - Separate load and store handling

                      into distinct transforms</p>

                    <p>(For folks I've discussed this with before, this

                      part is all new.)</p>

                    <p>Thinking about the problem, it finally occurred

                      to me that we can decompose the original example

                      into two steps: getting the loads out of the loop,

                      and sinking the stores out of the loop.  If we can

                      accomplish the former, but not the later, we've

                      still made meaningful progress.  <br>

                    </p>

                    <p>So, what'd we'd essentially have is a load only

                      transformation which produces this:<br>

                      int LocalCount = g_count;<br>

                      for (int = 0; i < N; i++)<br>

                        if (a[i] == 0) {<br>

                          LocalCount++;<br>

                          g_count = LocalCount;<br>

                        }</p>

                    <p>At this point, we've reduced memory traffic by

                      half, and opened up the possibility that other

                      parts of the optimizer can exploit the simplified

                      form.  The last point is particular interesting

                      since we generally try to canonicalize loads out

                      of loops, and much of the optimizer is tuned for a

                      form with as much as possible being loop

                      invariant.  As one example, completely by

                      accident, there's some work going on in the

                      LoopVectorizer right now to handle such stores to

                      loop invariant addresses during vectorization. 

                      Putting the two pieces together would let us fully

                      vectorize this loop without needing to sink the

                      stores at all.   </p>

                    <p>In practice, the existing implementation in LICM

                      would cleanly split along these lines with little

                      problem.  <br>

                    </p>

                    <p>One way of looking at this load specific

                      transform is as an extreme form of PRE (partial

                      redundancy elimination).  Our current PRE

                      implementation doesn't handle cases this

                      complicated. <br>

                    </p>

                  </div>

                </blockquote>

              </div>

            </blockquote>

            It occurred to my later that simply framing the new

            transform as a separate pass (LoopPRE) and using the same

            AST + SSA construction approach would be straight forward. 

            So, if folks think that having an aggressive form of load

            PRE in LICM is going a bit too far, it'd be easy to

            represent as an optional separate pass.  I'd still prefer

            having LICM contain the logic though.  <br>

            <blockquote type="cite">

              <div class="gmail_quote">

                <blockquote class="gmail_quote" style="margin:0 0 0

                  .8ex;border-left:1px #ccc solid;padding-left:1ex">

                  <div text="#000000" bgcolor="#FFFFFF">

                    <p> </p>

                    <p>Thoughts?</p>

                  </div>

                  <div text="#000000" bgcolor="#FFFFFF">

                    <p>Philip<br>

                    </p>

                  </div>

                </blockquote>

              </div>

            </blockquote>

            <br>

          </div>

        </blockquote>

      </div>

    </blockquote>

    <br>

  </body>

</html>