<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>I'm thinking about making some semi radical changes to load store

      promotion works in LICM, and figured it would be a good idea to

      get buy in before I actually started writing code.  :)</p>

    <p>TLDR: legality of sinking stores to exits is hard, can we

      separate load handling into a super aggressive form of PRE, and

      use predicated stores to avoid solving legality question?<br>

    </p>

    <p><br>

    </p>

    <p>Background<br>

    </p>

    <p>We've been seeing an interesting class of problems recently that

      looks roughly like this:</p>

    <p>for (int = 0; i < N; i++)<br>

        if (a[i] == 0) // some data dependent check<br>

          g_count++; // conditional load and store to shared location<br>

    </p>

    <p>The critical detail in this example is that g_count is a global

      location which may be accessed concurrently* by multiple threads. 

      The challenge here is that we don't know whether the store ever

      executes, and we're not allowed to insert a store along any path

      that didn't originally contain them.  Said differently, if all

      elements in "a" are non-zero, we're not allowed to store to

      g_count.  We do know that the g_count location is dereferenceable

      though.  <br>

    </p>

    <p>(*Please, let's avoid the memory model derailment here.  I'm

      simplifying and C++ language rules aren't real useful for my Java

      language frontend anyways.  In practice, all the access are

      atomic, but unordered, but we'll leave that out of discussion

      otherwise.)</p>

    <p>I have two approaches I'm considering here.  These are

      orthogonal, but I suspect we'll want to implement both.</p>

    <p><br>

    </p>

    <p>Proposal A - Use predicated stores in loop exits</p>

    <p>The basic idea is that we don't worry about solving the legality

      question above, and just insert a store which is predicated on a

      condition which is true exactly when the original store ran.  In

      pseudo code, this looks something like:</p>

    <p>bool StoreLegal = false;<br>

      int LocalCount = g_count;<br>

      for (int = 0; i < N; i++)<br>

        if (a[i] == 0) {<br>

          LocalCount++;<br>

          StoreLegal = true;<br>

        }<br>

      if (StoreLegal) g_count = LocalCount; <br>

    </p>

    <p>There are two obvious concerns here:</p>

    <ol>

      <li>The predicated store might be expensive in practice - true for

        most current architectures.<br>

      </li>

      <li>We''re introducing an extra boolean phi cycle around the

        loop.  <br>

      </li>

    </ol>

    <p>Talking to a couple of folks offline at the socials over the last

      few months, the former seems to be the main objection.  I think we

      can control this by restricting this transform to when the loop is

      believed to have a high trip count and the conditional path is

      taken with some frequency.   Getting feedback on this particular

      point is one of my main reasons for writing this mail.  <br>

    </p>

    <p>The second objection can frequently be resolved by finding other

      expressions which evaluate to the same boolean.  (In this case, if

      LocalCount != LocalCountOrig assuming i doesn't overflow.)  We

      already have a framework with SCEV to do these replacements. 

      Though, from some quick testing, it would definitely need

      strengthening.  However, SCEV can't remove the extra phi in all

      cases, so we have to be okay with the extra phi cycle in the

      general case.  This seems worthwhile to me, but thoughts?<br>

    </p>

    <p><br>

    </p>

    <p>Proposal B - Separate load and store handling into distinct

      transforms</p>

    <p>(For folks I've discussed this with before, this part is all

      new.)</p>

    <p>Thinking about the problem, it finally occurred to me that we can

      decompose the original example into two steps: getting the loads

      out of the loop, and sinking the stores out of the loop.  If we

      can accomplish the former, but not the later, we've still made

      meaningful progress.  <br>

    </p>

    <p>So, what'd we'd essentially have is a load only transformation

      which produces this:<br>

      int LocalCount = g_count;<br>

      for (int = 0; i < N; i++)<br>

        if (a[i] == 0) {<br>

          LocalCount++;<br>

          g_count = LocalCount;<br>

        }</p>

    <p>At this point, we've reduced memory traffic by half, and opened

      up the possibility that other parts of the optimizer can exploit

      the simplified form.  The last point is particular interesting

      since we generally try to canonicalize loads out of loops, and

      much of the optimizer is tuned for a form with as much as possible

      being loop invariant.  As one example, completely by accident,

      there's some work going on in the LoopVectorizer right now to

      handle such stores to loop invariant addresses during

      vectorization.  Putting the two pieces together would let us fully

      vectorize this loop without needing to sink the stores at all.   </p>

    <p>In practice, the existing implementation in LICM would cleanly

      split along these lines with little problem.  <br>

    </p>

    <p>One way of looking at this load specific transform is as an

      extreme form of PRE (partial redundancy elimination).  Our current

      PRE implementation doesn't handle cases this complicated. <br>

    </p>

    <p>Thoughts?</p>

    <p>Philip<br>

    </p>

  </body>

</html>