<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>This is going OT from the original thread, but, what the heck...</p>
<p>Alina, can you explain the challenge with implementing promotion
over MemorySSA? On the surface, it seems like it should be fairly
straight forward to provide an alias set like abstraction over
MemorySSA. What am I missing?</p>
<p>Here's a sketch of how I'd think to approach this:</p>
<p>Visit the MemoryPhi nodes in a loop. Every interesting promotion
case (mod in loop) must be involved in at least one MemoryPhi
cycle. <br>
</p>
<p>For each MemoryPhi in the header, create a set of all MemoryDef
instructions involved. (I'm saying this in terms of instructions,
but sets of MemoryLocations should be equivalent.)<br>
</p>
<p>For each instruction in loop, identify it's (optimized) memory
def. If that def is outside loop, and we're looking at a
load/store, then we can trivially hoist/sink (aliasing wise
only). If the def is in the loop, it must be involved in one of
our cycles, add it to related alias set.</p>
<p>When I last looked, we were debating whether MemoryPhis were
optimized by default. Did we end up deciding they weren't? If
so, I can definitely see the problem there. Even then,
constructing an AST for only the instructions involved in a loop
MemoryPhi cycle should be pretty cheap. <br>
</p>
<p><br>
</p>
<p>Separately, can you summarize what the overall status of
MemorySSA is? I've lost track. It looks like it's enabled by
default for EarlyCSE. So, does that mean we believe the bugs have
been worked out and we "just" need to plumb it through the
pipeline? <br>
</p>
<p>Philip<br>
</p>
<br>
<div class="moz-cite-prefix">On 09/13/2018 02:00 PM, Alina Sbirlea
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CALV8tXPGY5_3w7ceazST71nqxnkLXZVbd1wWZDBEpwSdgR9vPg@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<div dir="ltr">For context, I've been looking into replacing the
use of AST (AliasSetTracker) with MemorySSA in LICM. This works
great for sinking/hoisting but does not apply well for
promotion, and one of the solutions I considered is precisely
ripping out the promotion part and replacing its functionality
with a separate PRE pass + possibly store sinking. FWIW I think
that's the right way to go.
<div>I did not get deep enough into working on the solution but
I would gladly have a detailed discussion to move this
forward.</div>
<div><br>
</div>
<div>Reading into detail now.</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Alina</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr">On Thu, Sep 13, 2018 at 1:43 PM Philip Reames
<<a href="mailto:listmail@philipreames.com"
moz-do-not-send="true">listmail@philipreames.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"> (minor inline
additions)<br>
<br>
On 09/13/2018 01:51 AM, Chandler Carruth wrote:<br>
<blockquote type="cite">
<div dir="ltr">Haven't had time to dig into this, but
wanted to add <a class="m_-2310710337215009520GWVZpf
m_-2310710337215009520gW"
id="m_-2310710337215009520IloFPc-1"
href="mailto:asbirlea@google.com" target="_blank"
moz-do-not-send="true">+Alina Sbirlea</a> to the
thread as she has been working on promotion and other
aspects of LICM for a long time here.</div>
</blockquote>
Thanks!<br>
<blockquote type="cite">
<div class="gmail_quote">
<div dir="ltr">On Wed, Sep 12, 2018 at 11:41 PM Philip
Reames <<a href="mailto:listmail@philipreames.com"
target="_blank" moz-do-not-send="true">listmail@philipreames.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<p>I'm thinking about making some semi radical
changes to load store promotion works in LICM, and
figured it would be a good idea to get buy in
before I actually started writing code. :)</p>
<p>TLDR: legality of sinking stores to exits is
hard, can we separate load handling into a super
aggressive form of PRE, and use predicated stores
to avoid solving legality question?<br>
</p>
<p><br>
</p>
<p>Background<br>
</p>
<p>We've been seeing an interesting class of
problems recently that looks roughly like this:</p>
<p>for (int = 0; i < N; i++)<br>
if (a[i] == 0) // some data dependent check<br>
g_count++; // conditional load and store to
shared location<br>
</p>
<p>The critical detail in this example is that
g_count is a global location which may be accessed
concurrently* by multiple threads. The challenge
here is that we don't know whether the store ever
executes, and we're not allowed to insert a store
along any path that didn't originally contain
them. Said differently, if all elements in "a"
are non-zero, we're not allowed to store to
g_count. We do know that the g_count location is
dereferenceable though. <br>
</p>
<p>(*Please, let's avoid the memory model derailment
here. I'm simplifying and C++ language rules
aren't real useful for my Java language frontend
anyways. In practice, all the access are atomic,
but unordered, but we'll leave that out of
discussion otherwise.)</p>
<p>I have two approaches I'm considering here.
These are orthogonal, but I suspect we'll want to
implement both.</p>
<p><br>
</p>
<p>Proposal A - Use predicated stores in loop exits</p>
<p>The basic idea is that we don't worry about
solving the legality question above, and just
insert a store which is predicated on a condition
which is true exactly when the original store
ran. In pseudo code, this looks something like:</p>
<p>bool StoreLegal = false;<br>
int LocalCount = g_count;<br>
for (int = 0; i < N; i++)<br>
if (a[i] == 0) {<br>
LocalCount++;<br>
StoreLegal = true;<br>
}<br>
if (StoreLegal) g_count = LocalCount; <br>
</p>
<p>There are two obvious concerns here:</p>
<ol>
<li>The predicated store might be expensive in
practice - true for most current architectures.<br>
</li>
<li>We''re introducing an extra boolean phi cycle
around the loop. <br>
</li>
</ol>
<p>Talking to a couple of folks offline at the
socials over the last few months, the former seems
to be the main objection. I think we can control
this by restricting this transform to when the
loop is believed to have a high trip count and the
conditional path is taken with some frequency.
Getting feedback on this particular point is one
of my main reasons for writing this mail. <br>
</p>
<p>The second objection can frequently be resolved
by finding other expressions which evaluate to the
same boolean. (In this case, if LocalCount !=
LocalCountOrig assuming i doesn't overflow.) We
already have a framework with SCEV to do these
replacements. Though, from some quick testing, it
would definitely need strengthening. However,
SCEV can't remove the extra phi in all cases, so
we have to be okay with the extra phi cycle in the
general case. This seems worthwhile to me, but
thoughts?<br>
</p>
<p><br>
</p>
<p>Proposal B - Separate load and store handling
into distinct transforms</p>
<p>(For folks I've discussed this with before, this
part is all new.)</p>
<p>Thinking about the problem, it finally occurred
to me that we can decompose the original example
into two steps: getting the loads out of the loop,
and sinking the stores out of the loop. If we can
accomplish the former, but not the later, we've
still made meaningful progress. <br>
</p>
<p>So, what'd we'd essentially have is a load only
transformation which produces this:<br>
int LocalCount = g_count;<br>
for (int = 0; i < N; i++)<br>
if (a[i] == 0) {<br>
LocalCount++;<br>
g_count = LocalCount;<br>
}</p>
<p>At this point, we've reduced memory traffic by
half, and opened up the possibility that other
parts of the optimizer can exploit the simplified
form. The last point is particular interesting
since we generally try to canonicalize loads out
of loops, and much of the optimizer is tuned for a
form with as much as possible being loop
invariant. As one example, completely by
accident, there's some work going on in the
LoopVectorizer right now to handle such stores to
loop invariant addresses during vectorization.
Putting the two pieces together would let us fully
vectorize this loop without needing to sink the
stores at all. </p>
<p>In practice, the existing implementation in LICM
would cleanly split along these lines with little
problem. <br>
</p>
<p>One way of looking at this load specific
transform is as an extreme form of PRE (partial
redundancy elimination). Our current PRE
implementation doesn't handle cases this
complicated. <br>
</p>
</div>
</blockquote>
</div>
</blockquote>
It occurred to my later that simply framing the new
transform as a separate pass (LoopPRE) and using the same
AST + SSA construction approach would be straight forward.
So, if folks think that having an aggressive form of load
PRE in LICM is going a bit too far, it'd be easy to
represent as an optional separate pass. I'd still prefer
having LICM contain the logic though. <br>
<blockquote type="cite">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<p> </p>
<p>Thoughts?</p>
</div>
<div text="#000000" bgcolor="#FFFFFF">
<p>Philip<br>
</p>
</div>
</blockquote>
</div>
</blockquote>
<br>
</div>
</blockquote>
</div>
</blockquote>
<br>
</body>
</html>