[llvm-dev] Generalizing load/store promotion in LICM

Mon Sep 24 14:58:01 PDT 2018

On Wed, Sep 19, 2018 at 10:06 AM Philip Reames via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

>
>
> On 09/18/2018 02:50 PM, Alina Sbirlea wrote:
>
>
> On Fri, Sep 14, 2018 at 4:25 PM Philip Reames <listmail at philipreames.com>
> wrote:
>
>> This is going OT from the original thread, but, what the heck...
>>
> Sorry, not my intention, I was just giving another reason why getting
> promotion done in LICM differently would be helpful.
>
>> Alina, can you explain the challenge with implementing promotion over
>> MemorySSA?  On the surface, it seems like it should be fairly straight
>> forward to provide an alias set like abstraction over MemorySSA.  What am I
>> missing?
>>
> Yes, it is straight forward to have alias sets on top of MemorySSA, but it
> will likely lose most of MemorySSA's benefits.
>
> Ah, understood.  I think this was the core source of our confusion.  I was
> thinking "how to we migrate to using MemorySSA".  You were thinking "how do
> we exploit memory ssa".  :)
>
> Here's a sketch of how I'd think to approach this:
>>
>> Visit the MemoryPhi nodes in a loop.  Every interesting promotion case
>> (mod in loop) must be involved in at least one MemoryPhi cycle.
>>
>> For each MemoryPhi in the header, create a set of all MemoryDef
>> instructions involved.  (I'm saying this in terms of instructions, but sets
>> of MemoryLocations should be equivalent.)
>>
>> For each instruction in loop, identify it's (optimized) memory def.  If
>> that def is outside loop, and we're looking at a load/store, then we can
>> trivially hoist/sink (aliasing wise only).  If the def is in the loop, it
>> must be involved in one of our cycles, add it to related alias set.
>>
>> When I last looked, we were debating whether MemoryPhis were optimized by
>> default.  Did we end up deciding they weren't?  If so, I can definitely see
>> the problem there.  Even then, constructing an AST for only the
>> instructions involved in a loop MemoryPhi cycle should be pretty cheap.
>>
> AFAIK there is no notion of MemoryPhis being optimized. Each block has a
> *single* MemoryPhi, which has the last Def from each incoming block.
> MemoryDefs and MemoryUses can be optimized through Phis. MemoryDefs are
> also not optimized from the start, only MemoryUses are.
>
> All MemoryDefs build a single def chain, regardless of whether they alias
> or not. Doing alias sets means checking alias() at least for all pairs of
> Defs, since alias relation is not transitive. That is, even if we optimize
> all MemoryDefs, we may still need additional alias() calls to build those
> sets. But the cheaper way here is not to optimize all Defs, but instead do
> alias/modref calls only for the Def we're processing currently to determine
> if *that* can be moved (+ cache those results in some new form of alias
> sets if a "leave this in the loop" answer is received).
>
> We may also need additional alias() checks for Uses, since, again, due to
> non-transitivity, a Use's clobbering access may be a Def with which it
> MayAlias (let's call it D1), but that D1 may be NoAlias with some D2,
> giving no info on the (Use, D2) alias relation. If NoAlias, D2 can be
> promoted, if MustAlias, the set (Use, D2) can be promoted, if MayAlias, we
> cannot promote anything.
>
> Putting the above in my own words to confirm understanding:
>
> We can form alias sets by starting with all memory defs in the loop (which
> must be part of a single MemoryDef/MemoryPhi cycle).  Applying existing
> AS::add to these instructions gives us the set of possibly modifying alias
> sets (but without ref info yet).  We can then visit each MemoryUse and add
> it to the alias set of it's (optimized) MemoryDef if that def is in the
> loop, or introduce a new ref only AS if not.
>

AFAICT, that would be a correct implementation.

>
>
> The AliasSetTracker has a cap after which it gives up and merges all sets,
> and so will an implementation of sets with MemorySSA, since the cost is the
> same: lots of alias/modref calls. Granted, MemorySSA has fewer, especially
> for Uses, but we will still need a cap for pathological cases, while
> getting benefits for small loops.
>
> Agreed.
>
>
> Trying to get back to the original topic. The reason promotion is so
> difficult in the current form, is that it promotes a whole AliasSet. And
> that AliasSet is built based on conservative alias decisions. If we were to
> split the promotion problem into a PRE for loads, then sink for stores, the
> problem becomes more manageable as far as the number of alias calls; we'd
> process one memory-accessing instruction at a time, and, with the right
> processing order, MemorySSA should be a better fit.
>
> Er, huh?  We've already hoisted/sunk all of the loads which would form
> AliasSets w/o Mod set.  All that's left by the point we're doing promotion
> are alias sets with Mods and (optionally) Refs.  Aren't we pretty much
> stuck with doing the precise aliasing to ensure MustAlias at that point?
> Or is your point that we can delay the heavy weight work until after
> hoist/sink has been done?  If so, then yes, I agree.
>

Yes, we would ideally delay this until after hoist/sink is done. It's also
why in the draft promotion I had, I was attempting to build alias sets just
before promotion, not when LICM starts.

> p.s. I added store hoisting to LICM recently which does require mustalias
> mod information during hoisting.  This could be separate into a later
> transform easily though.
>

Yes, I noticed. It's the one thing MemorySSA cannot match right now for
sinking due to needing alias sets :) (before this change, it was just
promotion).
But it is a well justified change for the AST; if it spent all that time
computing the sets already, might as well use that.

>
>
>> Separately, can you summarize what the overall status of MemorySSA is?
>> I've lost track.  It looks like it's enabled by default for EarlyCSE.  So,
>> does that mean we believe the bugs have been worked out and we "just" need
>> to plumb it through the pipeline?
>>
> Sure!
>
> MemorySSA is fairly stable right now (i.e. no known bugs). I've been
> working on extending the APIs for updating it (e.g. D45299
> <https://reviews.llvm.org/D45299>, D48396
> <https://reviews.llvm.org/D48396>, D48897
> <https://reviews.llvm.org/D48897>), and added the plumbing to update it
> in the Utils (D48790 <https://reviews.llvm.org/D48790>, D45300
> <https://reviews.llvm.org/D45300>) and through the loop pass pipeline (
> D50906 <https://reviews.llvm.org/D50906>, D50911
> <https://reviews.llvm.org/D50911>, D45301
> <https://reviews.llvm.org/D45301>, D47022
> <https://reviews.llvm.org/D47022>, D51718
> <https://reviews.llvm.org/D51718>).
> LICM is the loop pass I've picked back up now, and the first that also
> uses MemorySSA vs just updating it. Patches in Phabricator are fairly old (
> D40375 <https://reviews.llvm.org/D40375>, D35741
> <https://reviews.llvm.org/D35741>). I'm going to upload the rebased and
> updated versions once they're ready for review.
> The actual use/update is currently disabled by default (flag to enable:
> EnableMSSALoopDependency).
>
> I also recently became aware of EarlyCSE using MemorySSA. There are no
> correctness issues with that but there may be missed optimizations in
> subsequent passes using MemorySSA (D51960
> <https://reviews.llvm.org/D51960> for details and to avoid going even
> more off-topic) .
>
> Thanks,
> Alina
>
> >
> > Philip
>
>>
>> On 09/13/2018 02:00 PM, Alina Sbirlea wrote:
>>
>> For context, I've been looking into replacing the use of AST
>> (AliasSetTracker) with MemorySSA in LICM. This works great for
>> sinking/hoisting but does not apply well for promotion, and one of the
>> solutions I considered is precisely ripping out the promotion part and
>> replacing its functionality with a separate PRE pass + possibly store
>> sinking. FWIW I think that's the right way to go.
>> I did not get deep enough into working on the solution but I would gladly
>> have a detailed discussion to move this forward.
>>
>> Reading into detail now.
>>
>> Thanks,
>> Alina
>>
>> On Thu, Sep 13, 2018 at 1:43 PM Philip Reames <listmail at philipreames.com>
>> wrote:
>>
>>> (minor inline additions)
>>>
>>> On 09/13/2018 01:51 AM, Chandler Carruth wrote:
>>>
>>> Haven't had time to dig into this, but wanted to add +Alina Sbirlea
>>> <asbirlea at google.com> to the thread as she has been working on
>>> promotion and other aspects of LICM for a long time here.
>>>
>>> Thanks!
>>>
>>> On Wed, Sep 12, 2018 at 11:41 PM Philip Reames <
>>> listmail at philipreames.com> wrote:
>>>
>>>> I'm thinking about making some semi radical changes to load store
>>>> promotion works in LICM, and figured it would be a good idea to get buy in
>>>> before I actually started writing code.  :)
>>>>
>>>> TLDR: legality of sinking stores to exits is hard, can we separate load
>>>> handling into a super aggressive form of PRE, and use predicated stores to
>>>> avoid solving legality question?
>>>>
>>>>
>>>> Background
>>>>
>>>> We've been seeing an interesting class of problems recently that looks
>>>> roughly like this:
>>>>
>>>> for (int = 0; i < N; i++)
>>>>   if (a[i] == 0) // some data dependent check
>>>>     g_count++; // conditional load and store to shared location
>>>>
>>>> The critical detail in this example is that g_count is a global
>>>> location which may be accessed concurrently* by multiple threads.  The
>>>> challenge here is that we don't know whether the store ever executes, and
>>>> we're not allowed to insert a store along any path that didn't originally
>>>> contain them.  Said differently, if all elements in "a" are non-zero, we're
>>>> not allowed to store to g_count.  We do know that the g_count location is
>>>> dereferenceable though.
>>>>
>>>> (*Please, let's avoid the memory model derailment here.  I'm
>>>> simplifying and C++ language rules aren't real useful for my Java language
>>>> frontend anyways.  In practice, all the access are atomic, but unordered,
>>>> but we'll leave that out of discussion otherwise.)
>>>>
>>>> I have two approaches I'm considering here.  These are orthogonal, but
>>>> I suspect we'll want to implement both.
>>>>
>>>>
>>>> Proposal A - Use predicated stores in loop exits
>>>>
>>>> The basic idea is that we don't worry about solving the legality
>>>> question above, and just insert a store which is predicated on a condition
>>>> which is true exactly when the original store ran.  In pseudo code, this
>>>> looks something like:
>>>>
>>>> bool StoreLegal = false;
>>>> int LocalCount = g_count;
>>>> for (int = 0; i < N; i++)
>>>>   if (a[i] == 0) {
>>>>     LocalCount++;
>>>>     StoreLegal = true;
>>>>   }
>>>> if (StoreLegal) g_count = LocalCount;
>>>>
>>>> There are two obvious concerns here:
>>>>
>>>>    1. The predicated store might be expensive in practice - true for
>>>>    most current architectures.
>>>>    2. We''re introducing an extra boolean phi cycle around the loop.
>>>>
>>>> Talking to a couple of folks offline at the socials over the last few
>>>> months, the former seems to be the main objection.  I think we can control
>>>> this by restricting this transform to when the loop is believed to have a
>>>> high trip count and the conditional path is taken with some frequency.
>>>> Getting feedback on this particular point is one of my main reasons for
>>>> writing this mail.
>>>>
>>>> The second objection can frequently be resolved by finding other
>>>> expressions which evaluate to the same boolean.  (In this case, if
>>>> LocalCount != LocalCountOrig assuming i doesn't overflow.)  We already have
>>>> a framework with SCEV to do these replacements.  Though, from some quick
>>>> testing, it would definitely need strengthening.  However, SCEV can't
>>>> remove the extra phi in all cases, so we have to be okay with the extra phi
>>>> cycle in the general case.  This seems worthwhile to me, but thoughts?
>>>>
>>>>
>>>> Proposal B - Separate load and store handling into distinct transforms
>>>>
>>>> (For folks I've discussed this with before, this part is all new.)
>>>>
>>>> Thinking about the problem, it finally occurred to me that we can
>>>> decompose the original example into two steps: getting the loads out of the
>>>> loop, and sinking the stores out of the loop.  If we can accomplish the
>>>> former, but not the later, we've still made meaningful progress.
>>>>
>>>> So, what'd we'd essentially have is a load only transformation which
>>>> produces this:
>>>> int LocalCount = g_count;
>>>> for (int = 0; i < N; i++)
>>>>   if (a[i] == 0) {
>>>>     LocalCount++;
>>>>     g_count = LocalCount;
>>>>   }
>>>>
>>>> At this point, we've reduced memory traffic by half, and opened up the
>>>> possibility that other parts of the optimizer can exploit the simplified
>>>> form.  The last point is particular interesting since we generally try to
>>>> canonicalize loads out of loops, and much of the optimizer is tuned for a
>>>> form with as much as possible being loop invariant.  As one example,
>>>> completely by accident, there's some work going on in the LoopVectorizer
>>>> right now to handle such stores to loop invariant addresses during
>>>> vectorization.  Putting the two pieces together would let us fully
>>>> vectorize this loop without needing to sink the stores at all.
>>>>
>>>> In practice, the existing implementation in LICM would cleanly split
>>>> along these lines with little problem.
>>>>
>>>> One way of looking at this load specific transform is as an extreme
>>>> form of PRE (partial redundancy elimination).  Our current PRE
>>>> implementation doesn't handle cases this complicated.
>>>>
>>> It occurred to my later that simply framing the new transform as a
>>> separate pass (LoopPRE) and using the same AST + SSA construction approach
>>> would be straight forward.  So, if folks think that having an aggressive
>>> form of load PRE in LICM is going a bit too far, it'd be easy to represent
>>> as an optional separate pass.  I'd still prefer having LICM contain the
>>> logic though.
>>>
>>> Thoughts?
>>>>
>>>> Philip
>>>>
>>>
>>>
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180924/3350b624/attachment.html>