<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>I'm thinking about making some semi radical changes to load store
promotion works in LICM, and figured it would be a good idea to
get buy in before I actually started writing code. :)</p>
<p>TLDR: legality of sinking stores to exits is hard, can we
separate load handling into a super aggressive form of PRE, and
use predicated stores to avoid solving legality question?<br>
</p>
<p><br>
</p>
<p>Background<br>
</p>
<p>We've been seeing an interesting class of problems recently that
looks roughly like this:</p>
<p>for (int = 0; i < N; i++)<br>
if (a[i] == 0) // some data dependent check<br>
g_count++; // conditional load and store to shared location<br>
</p>
<p>The critical detail in this example is that g_count is a global
location which may be accessed concurrently* by multiple threads.
The challenge here is that we don't know whether the store ever
executes, and we're not allowed to insert a store along any path
that didn't originally contain them. Said differently, if all
elements in "a" are non-zero, we're not allowed to store to
g_count. We do know that the g_count location is dereferenceable
though. <br>
</p>
<p>(*Please, let's avoid the memory model derailment here. I'm
simplifying and C++ language rules aren't real useful for my Java
language frontend anyways. In practice, all the access are
atomic, but unordered, but we'll leave that out of discussion
otherwise.)</p>
<p>I have two approaches I'm considering here. These are
orthogonal, but I suspect we'll want to implement both.</p>
<p><br>
</p>
<p>Proposal A - Use predicated stores in loop exits</p>
<p>The basic idea is that we don't worry about solving the legality
question above, and just insert a store which is predicated on a
condition which is true exactly when the original store ran. In
pseudo code, this looks something like:</p>
<p>bool StoreLegal = false;<br>
int LocalCount = g_count;<br>
for (int = 0; i < N; i++)<br>
if (a[i] == 0) {<br>
LocalCount++;<br>
StoreLegal = true;<br>
}<br>
if (StoreLegal) g_count = LocalCount; <br>
</p>
<p>There are two obvious concerns here:</p>
<ol>
<li>The predicated store might be expensive in practice - true for
most current architectures.<br>
</li>
<li>We''re introducing an extra boolean phi cycle around the
loop. <br>
</li>
</ol>
<p>Talking to a couple of folks offline at the socials over the last
few months, the former seems to be the main objection. I think we
can control this by restricting this transform to when the loop is
believed to have a high trip count and the conditional path is
taken with some frequency. Getting feedback on this particular
point is one of my main reasons for writing this mail. <br>
</p>
<p>The second objection can frequently be resolved by finding other
expressions which evaluate to the same boolean. (In this case, if
LocalCount != LocalCountOrig assuming i doesn't overflow.) We
already have a framework with SCEV to do these replacements.
Though, from some quick testing, it would definitely need
strengthening. However, SCEV can't remove the extra phi in all
cases, so we have to be okay with the extra phi cycle in the
general case. This seems worthwhile to me, but thoughts?<br>
</p>
<p><br>
</p>
<p>Proposal B - Separate load and store handling into distinct
transforms</p>
<p>(For folks I've discussed this with before, this part is all
new.)</p>
<p>Thinking about the problem, it finally occurred to me that we can
decompose the original example into two steps: getting the loads
out of the loop, and sinking the stores out of the loop. If we
can accomplish the former, but not the later, we've still made
meaningful progress. <br>
</p>
<p>So, what'd we'd essentially have is a load only transformation
which produces this:<br>
int LocalCount = g_count;<br>
for (int = 0; i < N; i++)<br>
if (a[i] == 0) {<br>
LocalCount++;<br>
g_count = LocalCount;<br>
}</p>
<p>At this point, we've reduced memory traffic by half, and opened
up the possibility that other parts of the optimizer can exploit
the simplified form. The last point is particular interesting
since we generally try to canonicalize loads out of loops, and
much of the optimizer is tuned for a form with as much as possible
being loop invariant. As one example, completely by accident,
there's some work going on in the LoopVectorizer right now to
handle such stores to loop invariant addresses during
vectorization. Putting the two pieces together would let us fully
vectorize this loop without needing to sink the stores at all. </p>
<p>In practice, the existing implementation in LICM would cleanly
split along these lines with little problem. <br>
</p>
<p>One way of looking at this load specific transform is as an
extreme form of PRE (partial redundancy elimination). Our current
PRE implementation doesn't handle cases this complicated. <br>
</p>
<p>Thoughts?</p>
<p>Philip<br>
</p>
</body>
</html>