<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Wed, Sep 19, 2018 at 10:06 AM Philip Reames via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<p><br>
</p>
<br>
<div class="m_4704851736071475128moz-cite-prefix">On 09/18/2018 02:50 PM, Alina Sbirlea
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr"><br>
<div class="gmail_quote">
<div dir="ltr">On Fri, Sep 14, 2018 at 4:25 PM Philip Reames
<<a href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<p>This is going OT from the original thread, but, what
the heck...</p>
</div>
</blockquote>
<div>Sorry, not my intention, I was just giving another reason
why getting promotion done in LICM differently would be
helpful. </div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<p>Alina, can you explain the challenge with implementing
promotion over MemorySSA? On the surface, it seems like
it should be fairly straight forward to provide an alias
set like abstraction over MemorySSA. What am I missing?</p>
</div>
</blockquote>
<div>Yes, it is straight forward to have alias sets on top of
MemorySSA, but it will likely lose most of MemorySSA's
benefits. <br>
</div>
</div>
</div>
</blockquote>
Ah, understood. I think this was the core source of our confusion.
I was thinking "how to we migrate to using MemorySSA". You were
thinking "how do we exploit memory ssa". :)<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<p>Here's a sketch of how I'd think to approach this:</p>
<p>Visit the MemoryPhi nodes in a loop. Every interesting
promotion case (mod in loop) must be involved in at
least one MemoryPhi cycle. <br>
</p>
<p>For each MemoryPhi in the header, create a set of all
MemoryDef instructions involved. (I'm saying this in
terms of instructions, but sets of MemoryLocations
should be equivalent.)<br>
</p>
<p>For each instruction in loop, identify it's (optimized)
memory def. If that def is outside loop, and we're
looking at a load/store, then we can trivially
hoist/sink (aliasing wise only). If the def is in the
loop, it must be involved in one of our cycles, add it
to related alias set.</p>
<p>When I last looked, we were debating whether MemoryPhis
were optimized by default. Did we end up deciding they
weren't? If so, I can definitely see the problem
there. Even then, constructing an AST for only the
instructions involved in a loop MemoryPhi cycle should
be pretty cheap. <br>
</p>
</div>
</blockquote>
<div>AFAIK there is no notion of MemoryPhis being optimized.
Each block has a *single* MemoryPhi, which has the last Def
from each incoming block. MemoryDefs and MemoryUses can be
optimized through Phis. MemoryDefs are also not optimized
from the start, only MemoryUses are.</div>
<div><br>
</div>
<div>All MemoryDefs build a single def chain, regardless of
whether they alias or not. Doing alias sets means checking
alias() at least for all pairs of Defs, since alias relation
is not transitive. That is, even if we optimize all
MemoryDefs, we may still need additional alias() calls to
build those sets. But the cheaper way here is not to
optimize all Defs, but instead do alias/modref calls only
for the Def we're processing currently to determine if
*that* can be moved (+ cache those results in some new form
of alias sets if a "leave this in the loop" answer is
received).</div>
<div><br>
</div>
<div>We may also need additional alias() checks for Uses,
since, again, due to non-transitivity, a Use's clobbering
access may be a Def with which it MayAlias (let's call it
D1), but that D1 may be NoAlias with some D2, giving no info
on the (Use, D2) alias relation. If NoAlias, D2 can be
promoted, if MustAlias, the set (Use, D2) can be promoted,
if MayAlias, we cannot promote anything.</div>
</div>
</div>
</blockquote>
Putting the above in my own words to confirm understanding:<br>
<br>
We can form alias sets by starting with all memory defs in the loop
(which must be part of a single MemoryDef/MemoryPhi cycle).
Applying existing AS::add to these instructions gives us the set of
possibly modifying alias sets (but without ref info yet). We can
then visit each MemoryUse and add it to the alias set of it's
(optimized) MemoryDef if that def is in the loop, or introduce a new
ref only AS if not. <br></div></blockquote><div><br></div><div>AFAICT, that would be a correct implementation.</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div text="#000000" bgcolor="#FFFFFF">
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div>The AliasSetTracker has a cap after which it gives up and
merges all sets, and so will an implementation of sets with
MemorySSA, since the cost is the same: lots of alias/modref
calls. Granted, MemorySSA has fewer, especially for Uses,
but we will still need a cap for pathological cases, while
getting benefits for small loops.<br>
</div>
</div>
</div>
</blockquote>
Agreed.<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<div>Trying to get back to the original topic. The reason
promotion is so difficult in the current form, is that it
promotes a whole AliasSet. And that AliasSet is built based
on conservative alias decisions. If we were to split the
promotion problem into a PRE for loads, then sink for
stores, the problem becomes more manageable as far as the
number of alias calls; we'd process one memory-accessing
instruction at a time, and, with the right processing order,
MemorySSA should be a better fit.</div>
</div>
</div>
</blockquote>
Er, huh? We've already hoisted/sunk all of the loads which would
form AliasSets w/o Mod set. All that's left by the point we're
doing promotion are alias sets with Mods and (optionally) Refs.
Aren't we pretty much stuck with doing the precise aliasing to
ensure MustAlias at that point? Or is your point that we can delay
the heavy weight work until after hoist/sink has been done? If so,
then yes, I agree. <br></div></blockquote><div><br></div><div>Yes, we would ideally delay this until after hoist/sink is done. It's also why in the draft promotion I had, I was attempting to build alias sets just before promotion, not when LICM starts.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div text="#000000" bgcolor="#FFFFFF">
<br>
p.s. I added store hoisting to LICM recently which does require
mustalias mod information during hoisting. This could be separate
into a later transform easily though.<br></div></blockquote><div> </div><div>Yes, I noticed. It's the one thing MemorySSA cannot match right now for sinking due to needing alias sets :) (before this change, it was just promotion).</div><div>But it is a well justified change for the AST; if it spent all that time computing the sets already, might as well use that.</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div text="#000000" bgcolor="#FFFFFF">
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<p> </p>
<p><br>
</p>
<p>Separately, can you summarize what the overall status
of MemorySSA is? I've lost track. It looks like it's
enabled by default for EarlyCSE. So, does that mean we
believe the bugs have been worked out and we "just" need
to plumb it through the pipeline? <br>
</p>
</div>
</blockquote>
<div>Sure!</div>
<div><br>
</div>
<div>MemorySSA is fairly stable right now (i.e. no known
bugs). I've been working on extending the APIs for updating
it (e.g. <span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D45299" style="text-decoration-line:none" target="_blank">D45299</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D48396" style="text-decoration-line:none" target="_blank">D48396</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D48897" style="text-decoration-line:none" target="_blank">D48897</a></span>),
and added the plumbing to update it in the Utils (<span style="font-family:Arial;font-size:11pt;white-space:pre-wrap;text-decoration-line:underline;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline"><a href="https://reviews.llvm.org/D48790" style="font-family:Arial;font-size:14.6667px;white-space:pre-wrap;text-decoration-line:none" target="_blank">D48790</a>, </span><span style="font-family:Arial;font-size:11pt;white-space:pre-wrap;text-decoration-line:underline;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline"><a href="https://reviews.llvm.org/D45300" style="font-family:Arial;font-size:14.6667px;white-space:pre-wrap;text-decoration-line:none" target="_blank">D45300</a></span>)
and through the loop pass pipeline (<span style="font-family:Arial;font-size:11pt;white-space:pre-wrap;text-decoration-line:underline;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline"><a href="https://reviews.llvm.org/D50906" style="font-family:Arial;font-size:14.6667px;white-space:pre-wrap;text-decoration-line:none" target="_blank">D50906</a>, </span><span style="font-family:Arial;font-size:11pt;white-space:pre-wrap;text-decoration-line:underline;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline"><a href="https://reviews.llvm.org/D50911" style="font-family:Arial;font-size:14.6667px;white-space:pre-wrap;text-decoration-line:none" target="_blank">D50911</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D45301" style="text-decoration-line:none" target="_blank">D45301</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D47022" style="text-decoration-line:none" target="_blank">D47022</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D51718" style="text-decoration-line:none" target="_blank">D51718</a></span>). </div>
<div>LICM is the loop pass I've picked back up now, and the
first that also uses MemorySSA vs just updating it. Patches
in Phabricator are fairly old (<span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D40375" style="text-decoration-line:none" target="_blank">D40375</a>, </span><span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D35741" style="text-decoration-line:none" target="_blank">D35741</a></span>).
I'm going to upload the rebased and updated versions once
they're ready for review.</div>
<div>The actual use/update is currently disabled by default
(flag to enable: EnableMSSALoopDependency).</div>
<div><br>
</div>
<div>I also recently became aware of EarlyCSE using MemorySSA.
There are no correctness issues with that but there may be
missed optimizations in subsequent passes using MemorySSA (<span style="text-decoration-line:underline;font-size:11pt;font-family:Arial;background-color:transparent;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><a href="https://reviews.llvm.org/D51960" style="text-decoration-line:none" target="_blank">D51960</a></span> for
details and to avoid going even more off-topic) .</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Alina</div>
<div><br>
><br>
> Philip</div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"> <br>
<div class="m_4704851736071475128m_4777266497825329610moz-cite-prefix">On
09/13/2018 02:00 PM, Alina Sbirlea wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">For context, I've been looking into
replacing the use of AST (AliasSetTracker) with
MemorySSA in LICM. This works great for
sinking/hoisting but does not apply well for
promotion, and one of the solutions I considered is
precisely ripping out the promotion part and replacing
its functionality with a separate PRE pass + possibly
store sinking. FWIW I think that's the right way to
go.
<div>I did not get deep enough into working on the
solution but I would gladly have a detailed
discussion to move this forward.</div>
<div><br>
</div>
<div>Reading into detail now.</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Alina</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr">On Thu, Sep 13, 2018 at 1:43 PM Philip
Reames <<a href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"> (minor inline
additions)<br>
<br>
On 09/13/2018 01:51 AM, Chandler Carruth wrote:<br>
<blockquote type="cite">
<div dir="ltr">Haven't had time to dig into
this, but wanted to add <a class="m_4704851736071475128m_4777266497825329610m_-2310710337215009520GWVZpf
m_4777266497825329610m_-2310710337215009520gW" id="m_4704851736071475128m_4777266497825329610m_-2310710337215009520IloFPc-1" href="mailto:asbirlea@google.com" target="_blank">+Alina
Sbirlea</a> to the thread as she has been
working on promotion and other aspects of LICM
for a long time here.</div>
</blockquote>
Thanks!<br>
<blockquote type="cite">
<div class="gmail_quote">
<div dir="ltr">On Wed, Sep 12, 2018 at 11:41
PM Philip Reames <<a href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<p>I'm thinking about making some semi
radical changes to load store promotion
works in LICM, and figured it would be a
good idea to get buy in before I
actually started writing code. :)</p>
<p>TLDR: legality of sinking stores to
exits is hard, can we separate load
handling into a super aggressive form of
PRE, and use predicated stores to avoid
solving legality question?<br>
</p>
<p><br>
</p>
<p>Background<br>
</p>
<p>We've been seeing an interesting class
of problems recently that looks roughly
like this:</p>
<p>for (int = 0; i < N; i++)<br>
if (a[i] == 0) // some data dependent
check<br>
g_count++; // conditional load and
store to shared location<br>
</p>
<p>The critical detail in this example is
that g_count is a global location which
may be accessed concurrently* by
multiple threads. The challenge here is
that we don't know whether the store
ever executes, and we're not allowed to
insert a store along any path that
didn't originally contain them. Said
differently, if all elements in "a" are
non-zero, we're not allowed to store to
g_count. We do know that the g_count
location is dereferenceable though. <br>
</p>
<p>(*Please, let's avoid the memory model
derailment here. I'm simplifying and
C++ language rules aren't real useful
for my Java language frontend anyways.
In practice, all the access are atomic,
but unordered, but we'll leave that out
of discussion otherwise.)</p>
<p>I have two approaches I'm considering
here. These are orthogonal, but I
suspect we'll want to implement both.</p>
<p><br>
</p>
<p>Proposal A - Use predicated stores in
loop exits</p>
<p>The basic idea is that we don't worry
about solving the legality question
above, and just insert a store which is
predicated on a condition which is true
exactly when the original store ran. In
pseudo code, this looks something like:</p>
<p>bool StoreLegal = false;<br>
int LocalCount = g_count;<br>
for (int = 0; i < N; i++)<br>
if (a[i] == 0) {<br>
LocalCount++;<br>
StoreLegal = true;<br>
}<br>
if (StoreLegal) g_count = LocalCount; <br>
</p>
<p>There are two obvious concerns here:</p>
<ol>
<li>The predicated store might be
expensive in practice - true for most
current architectures.<br>
</li>
<li>We''re introducing an extra boolean
phi cycle around the loop. <br>
</li>
</ol>
<p>Talking to a couple of folks offline at
the socials over the last few months,
the former seems to be the main
objection. I think we can control this
by restricting this transform to when
the loop is believed to have a high trip
count and the conditional path is taken
with some frequency. Getting feedback
on this particular point is one of my
main reasons for writing this mail. <br>
</p>
<p>The second objection can frequently be
resolved by finding other expressions
which evaluate to the same boolean. (In
this case, if LocalCount !=
LocalCountOrig assuming i doesn't
overflow.) We already have a framework
with SCEV to do these replacements.
Though, from some quick testing, it
would definitely need strengthening.
However, SCEV can't remove the extra phi
in all cases, so we have to be okay with
the extra phi cycle in the general
case. This seems worthwhile to me, but
thoughts?<br>
</p>
<p><br>
</p>
<p>Proposal B - Separate load and store
handling into distinct transforms</p>
<p>(For folks I've discussed this with
before, this part is all new.)</p>
<p>Thinking about the problem, it finally
occurred to me that we can decompose the
original example into two steps: getting
the loads out of the loop, and sinking
the stores out of the loop. If we can
accomplish the former, but not the
later, we've still made meaningful
progress. <br>
</p>
<p>So, what'd we'd essentially have is a
load only transformation which produces
this:<br>
int LocalCount = g_count;<br>
for (int = 0; i < N; i++)<br>
if (a[i] == 0) {<br>
LocalCount++;<br>
g_count = LocalCount;<br>
}</p>
<p>At this point, we've reduced memory
traffic by half, and opened up the
possibility that other parts of the
optimizer can exploit the simplified
form. The last point is particular
interesting since we generally try to
canonicalize loads out of loops, and
much of the optimizer is tuned for a
form with as much as possible being loop
invariant. As one example, completely
by accident, there's some work going on
in the LoopVectorizer right now to
handle such stores to loop invariant
addresses during vectorization. Putting
the two pieces together would let us
fully vectorize this loop without
needing to sink the stores at all. </p>
<p>In practice, the existing
implementation in LICM would cleanly
split along these lines with little
problem. <br>
</p>
<p>One way of looking at this load
specific transform is as an extreme form
of PRE (partial redundancy
elimination). Our current PRE
implementation doesn't handle cases this
complicated. <br>
</p>
</div>
</blockquote>
</div>
</blockquote>
It occurred to my later that simply framing the
new transform as a separate pass (LoopPRE) and
using the same AST + SSA construction approach
would be straight forward. So, if folks think
that having an aggressive form of load PRE in LICM
is going a bit too far, it'd be easy to represent
as an optional separate pass. I'd still prefer
having LICM contain the logic though. <br>
<blockquote type="cite">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<p> </p>
<p>Thoughts?</p>
</div>
<div text="#000000" bgcolor="#FFFFFF">
<p>Philip<br>
</p>
</div>
</blockquote>
</div>
</blockquote>
<br>
</div>
</blockquote>
</div>
</blockquote>
<br>
</div>
</blockquote>
</div>
</div>
</blockquote>
<br>
</div>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
</blockquote></div></div>