[PATCH] D8688: Update MergedLoadStoreMotion to use MemorySSA

Fri Jun 24 23:48:24 PDT 2016

Also, ust to give you some idea of general update complexity:  If you don't
bound the problem with some common cases, it's simply not possible to make
it faster than current creation (Ie it would be simpler/easier to not
preserve memoryssa in those cases).

For the common cases, assuming you tell us what variables to remove and
what variables were added in what blocks:
1. If you only removed and replaced variables (Ie literally no code motion
or insertion), it requires the renaming pass only.  It could be further
optimized to not redo use chain optimization for memoryuses if you can
guarantee you only touched memoryuses. For memorydefs, if you can guarantee
(and you probably can't) that none of the aa relationships have changed,
you would not have to redo use chain opt there either.  You can't guarantee
this because basicaa, et al have walkers with random hard limits in them,
and so you may get better results simply by reducing the length of a chain
of things it had to walk, even if there was in fact, no aliasing either way.

2. if the cfg has not changed, and you have added/removed variables: if you
added memoryuses, if we stay with pruned ssa, we have to recompute the IDF.
We could minimize the size we calculate by tracking some things related to
where the last uses in the program are relative to the iterated dominance
frontier. It's complicated however.
After recomputing IDF, we also have to redo renaming. For inserting
 memoryuses, if we tracked the first/last version of a block (and updated
it through the creation/removal apis), we only have to rename parts of the
dom tree where the first/last version changes.

(IE if we changed first version in a block due to phi node insertion, but
last version is the same, we do not have to rename children of that block.
 if we changed last version but not first version, we only have to rename
children but not the current block)
For inserting stores, it's something close to this but a bit more
complicated.

3. if the cfg has changed, without tracking a lot more under the covers
related to merge/join sets, it's not possible to make it faster than doing
it from scratch.

The TL;DR of all this is that a general update mechanism is very hard to
make faster than for scratch without giving it all kinds of specific info,
and usually figuring out that info is as hard as figuring out the update
algorithms, in my experience.

On Fri, Jun 24, 2016 at 10:41 PM, Daniel Berlin <dberlin at dberlin.org> wrote:

>
>
> On Fri, Jun 24, 2016, 5:16 PM Gerolf Hoflehner <ghoflehner at apple.com>
> wrote:
>
>> Gerolf added a comment.
>>
>> I'll take a more in depth look into the load and store merge routines
>> also. At a first glance it seems  one could just add a few MSSA hooks
>> rather than copy-paste-modify the code base.
>>
> It is possible, but the other version is o(n^3), the memory SSA version
> is o(n), so this is deliberate. It is not possible to make the other
> version less than n^2 without memory versions.
>
>
>
> I also take a (late :-() look at the core MSSA design. Could you outline
>> the high-level MSSA design in more detail than
>> http://reviews.llvm.org/D7864?Eg.
>
>
> There are also a bunch of mailing list threads and the comment in
> memoryssa.h, but happy to expand on whatever for you :)
>
>
> >how does it compare/relate to
> http://www.airs.com/dnovillo/Papers/mem-ssa.pdf?  Thanks!
>
> Sure.
>
>
> This paper covers what was the first version of memoryssa for GCC.  Over
> time, we significantly simplified the design after getting years of
> experience with clients.   GCC's current design is now identical to the
> llvm one.
>
> Compared to the paper, in both implementations, there are no kills
> anymore, and no partitioning. All loads and stores use and define a single
> memoryssa variable instead of multiple ones representing heap partitions.
> There is a long discussion of why in one of the threads, but it boils down
> to: it's not worth making it more precise for the cost you pay in time or
> memory or reasoning or complexity of code, for what clients really ended up
> wanting out of it.
>
> This pass is a good example. Having multiple variables  would not make it
> faster or more precise for load or store  merging  and may in fact make it
> much slower.  For passes that care, the walker interface provides a way to
> further try to disambiguate def def and def use chains.
>
>
>
>
>
>
>> -Gerolf
>>
>>
>> ================
>> Comment at: lib/Transforms/Scalar/MergedLoadStoreMotion.cpp:346
>> @@ -271,1 +345,3 @@
>>    HoistedInst->insertBefore(HoistPt);
>> +  if (MSSA) {
>> +    // We also hoist operands of loads using this function, so check to
>> see if
>> ----------------
>> Ok, it would probably not be possible foresee all use cases. I certainly
>> agree to the iterative approach you propose. I'm surprised though that a
>> simple replacement of a load has not come up elsewhere.
>>
>
> This is the first major pass conversion outside of earlycse. I am working
> pass by pass to preserve it for gvn.
>
> Note that this is also not simple load replacement, it's merging multiple
> loads into a single one by insertion of a new load and removing.  However,
> the operations performed are exactly what is happening to the non memoryssa
> ir, which is that you are inserting a new load, and removing the two old
> ones.
>
>>
>> ================
>> Comment at: lib/Transforms/Scalar/MergedLoadStoreMotion.cpp:1032
>> @@ +1031,3 @@
>> +
>> +void MergedLoadStoreMotion::updateMemorySSAForSunkStores(
>> +    BasicBlock *SinkLocation, StoreInst *S0, StoreInst *S1, StoreInst
>> *SNew) {
>> ----------------
>>
>> What got me here is that I expected the update to be simple.
>
>
> It would be if the stores were not may defs. Because they are, there may
> be more of them below in the same block that you are not touching.
>
> Ie.
> 1=memorydef(0)
> Store a
> 2= memorydef(1)
> Store b
>
> When you move a, you must update b's defining access to 0
>
> It is not possible to avoid this in all cases.   Even if you had multiple
> memoryssa variables, there is no perfect partitioning of the heap that
> guarantees you won't have a use.
>
> If you removed the defining access of may defs, you would not easily know
> how far a given memorydef version is valid for, or be able to tell the may
> defs that might kill your store.  In the above design, finding the lowest
> sink points  and uses  for a given store is easy, because you can walk the
> use chains.    This makes a number of optimization algorithms that are
> otherwise quadratic, linear.
>
>
>
> Why is here more involved than replacing two MD's with one, possibly
>> removing a MP?
>
>
> In the simple case, where there are no stores below you, it is in fact,
> that :)
>
> The other complication is that a mp may already exist in the block below
> you (imagine our store a, store b case in both sides of the diamond. A phi
> will exist in the merge block and your insertion point is after the phi)
>
> From the clients perspective updates like this look too complicated to get
>> right.
>
> Fwiw: I disagree, but I admit I've written the memoryssa update routines
> for a ton of  both gcc and llvm passes.  This is not materially different
> from the ir update you have to perform if you sink stores and their
> operands with pre.
> This is pretty much as complicated as it gets as well, because it's code
> motion, not just replacement with existing values.
>
>> Maybe there is verifier in place, but in the code I don't even see a
>> single assertion but could help debugging issues should something go wrong.
>
> There is a verifier that is run in both the updating  unit tests and the
> test cases for this pass that verify the updating is correct.
> Past that, removal and insertion routines will assert if you do something
> that is obviously wrong, much more than the ir does The verifier in both
> cases is used to catch screwups that cost too much to verify at all times.
> The verifier can be called at any time, and I'm happy to add a call to it
> here under xdebug if you like.
>
> Perhaps we can come up with some utility routines that make simpler to
>> grasp: when two memory or more memory operation get moved, this is the
>> blueprint for updating MSSA and guarantee it stays consistent.
>>
>
> I think we could in general, but probably not for this kind of specialized
> code motion.
>
> The simple cse case is already pretty simple (for loads, you do nothing
> but remove the dead load. For stores , replacealluseswith your
> definingacess, Remove the dead store).
>
> Note that only stores are ever complex to reason about when updating.
> Because loads are just uses, they never require much because doing things
> do them can't change the reaching versions or require new phi nodes (unless
> you sink loads instead of hoist them, because both memoryssa and the normal
> ir are pruned ssa, so they eliminate phi nodes that have no uses. We could
> just turn this back off if we wanted to guarantee no load sinking would
> ever require a new phi node, but I have not hit a pass that wants to sink
> loads)
>
> For stores, general downward code motion can get complex enough that it
> may be simpler and easier to not try to preserve the analysis.   This is
> true of normal ssa as well.   This pass is pretty much the limit of what
> I'd try to preserve rather than have it just recalculate memoryssa. The set
> of update cases is fixed and formally provable.
>
> But thoughts definitely welcome.
>
>>
>> ================
>> Comment at: lib/Transforms/Scalar/MergedLoadStoreMotion.cpp:1099
>> @@ +1098,3 @@
>> +
>> +  BasicBlock *MD0Block = MD0->getBlock();
>> +  for (auto UI = MD0->use_begin(), UE = MD0->use_end(); UI != UE;) {
>> ----------------
>> There could be a lambda or function that takes care of the two loops..
>>
>
>
>
>> http://reviews.llvm.org/D8688
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160624/bb586dc5/attachment.html>