[PATCH] D8688: Update MergedLoadStoreMotion to use MemorySSA

Thu Jun 30 11:35:16 PDT 2016

Ping!

On Fri, Jun 24, 2016 at 11:48 PM, Daniel Berlin <dberlin at dberlin.org> wrote:

> Also, ust to give you some idea of general update complexity:  If you
> don't bound the problem with some common cases, it's simply not possible to
> make it faster than current creation (Ie it would be simpler/easier to not
> preserve memoryssa in those cases).
>
>
>
> For the common cases, assuming you tell us what variables to remove and
> what variables were added in what blocks:
> 1. If you only removed and replaced variables (Ie literally no code motion
> or insertion), it requires the renaming pass only.  It could be further
> optimized to not redo use chain optimization for memoryuses if you can
> guarantee you only touched memoryuses. For memorydefs, if you can guarantee
> (and you probably can't) that none of the aa relationships have changed,
> you would not have to redo use chain opt there either.  You can't guarantee
> this because basicaa, et al have walkers with random hard limits in them,
> and so you may get better results simply by reducing the length of a chain
> of things it had to walk, even if there was in fact, no aliasing either way.
>
> 2. if the cfg has not changed, and you have added/removed variables: if
> you added memoryuses, if we stay with pruned ssa, we have to recompute the
> IDF. We could minimize the size we calculate by tracking some things
> related to where the last uses in the program are relative to the iterated
> dominance frontier. It's complicated however.
> After recomputing IDF, we also have to redo renaming. For inserting
>  memoryuses, if we tracked the first/last version of a block (and updated
> it through the creation/removal apis), we only have to rename parts of the
> dom tree where the first/last version changes.
>
> (IE if we changed first version in a block due to phi node insertion, but
> last version is the same, we do not have to rename children of that block.
>  if we changed last version but not first version, we only have to rename
> children but not the current block)
> For inserting stores, it's something close to this but a bit more
> complicated.
>
> 3. if the cfg has changed, without tracking a lot more under the covers
> related to merge/join sets, it's not possible to make it faster than doing
> it from scratch.
>
> The TL;DR of all this is that a general update mechanism is very hard to
> make faster than for scratch without giving it all kinds of specific info,
> and usually figuring out that info is as hard as figuring out the update
> algorithms, in my experience.
>
>
> On Fri, Jun 24, 2016 at 10:41 PM, Daniel Berlin <dberlin at dberlin.org>
> wrote:
>
>>
>>
>> On Fri, Jun 24, 2016, 5:16 PM Gerolf Hoflehner <ghoflehner at apple.com>
>> wrote:
>>
>>> Gerolf added a comment.
>>>
>>> I'll take a more in depth look into the load and store merge routines
>>> also. At a first glance it seems  one could just add a few MSSA hooks
>>> rather than copy-paste-modify the code base.
>>>
>> It is possible, but the other version is o(n^3), the memory SSA version
>> is o(n), so this is deliberate. It is not possible to make the other
>> version less than n^2 without memory versions.
>>
>>
>>
>> I also take a (late :-() look at the core MSSA design. Could you outline
>>> the high-level MSSA design in more detail than
>>> http://reviews.llvm.org/D7864?Eg.
>>
>>
>> There are also a bunch of mailing list threads and the comment in
>> memoryssa.h, but happy to expand on whatever for you :)
>>
>>
>> >how does it compare/relate to
>> http://www.airs.com/dnovillo/Papers/mem-ssa.pdf?  Thanks!
>>
>> Sure.
>>
>>
>> This paper covers what was the first version of memoryssa for GCC.  Over
>> time, we significantly simplified the design after getting years of
>> experience with clients.   GCC's current design is now identical to the
>> llvm one.
>>
>> Compared to the paper, in both implementations, there are no kills
>> anymore, and no partitioning. All loads and stores use and define a single
>> memoryssa variable instead of multiple ones representing heap partitions.
>> There is a long discussion of why in one of the threads, but it boils down
>> to: it's not worth making it more precise for the cost you pay in time or
>> memory or reasoning or complexity of code, for what clients really ended up
>> wanting out of it.
>>
>> This pass is a good example. Having multiple variables  would not make it
>> faster or more precise for load or store  merging  and may in fact make it
>> much slower.  For passes that care, the walker interface provides a way to
>> further try to disambiguate def def and def use chains.
>>
>>
>>
>>
>>
>>
>>> -Gerolf
>>>
>>>
>>> ================
>>> Comment at: lib/Transforms/Scalar/MergedLoadStoreMotion.cpp:346
>>> @@ -271,1 +345,3 @@
>>>    HoistedInst->insertBefore(HoistPt);
>>> +  if (MSSA) {
>>> +    // We also hoist operands of loads using this function, so check to
>>> see if
>>> ----------------
>>> Ok, it would probably not be possible foresee all use cases. I certainly
>>> agree to the iterative approach you propose. I'm surprised though that a
>>> simple replacement of a load has not come up elsewhere.
>>>
>>
>> This is the first major pass conversion outside of earlycse. I am working
>> pass by pass to preserve it for gvn.
>>
>> Note that this is also not simple load replacement, it's merging multiple
>> loads into a single one by insertion of a new load and removing.  However,
>> the operations performed are exactly what is happening to the non memoryssa
>> ir, which is that you are inserting a new load, and removing the two old
>> ones.
>>
>>>
>>> ================
>>> Comment at: lib/Transforms/Scalar/MergedLoadStoreMotion.cpp:1032
>>> @@ +1031,3 @@
>>> +
>>> +void MergedLoadStoreMotion::updateMemorySSAForSunkStores(
>>> +    BasicBlock *SinkLocation, StoreInst *S0, StoreInst *S1, StoreInst
>>> *SNew) {
>>> ----------------
>>>
>>> What got me here is that I expected the update to be simple.
>>
>>
>> It would be if the stores were not may defs. Because they are, there may
>> be more of them below in the same block that you are not touching.
>>
>> Ie.
>> 1=memorydef(0)
>> Store a
>> 2= memorydef(1)
>> Store b
>>
>> When you move a, you must update b's defining access to 0
>>
>> It is not possible to avoid this in all cases.   Even if you had multiple
>> memoryssa variables, there is no perfect partitioning of the heap that
>> guarantees you won't have a use.
>>
>> If you removed the defining access of may defs, you would not easily know
>> how far a given memorydef version is valid for, or be able to tell the may
>> defs that might kill your store.  In the above design, finding the lowest
>> sink points  and uses  for a given store is easy, because you can walk the
>> use chains.    This makes a number of optimization algorithms that are
>> otherwise quadratic, linear.
>>
>>
>>
>> Why is here more involved than replacing two MD's with one, possibly
>>> removing a MP?
>>
>>
>> In the simple case, where there are no stores below you, it is in fact,
>> that :)
>>
>> The other complication is that a mp may already exist in the block below
>> you (imagine our store a, store b case in both sides of the diamond. A phi
>> will exist in the merge block and your insertion point is after the phi)
>>
>> From the clients perspective updates like this look too complicated to
>>> get right.
>>
>> Fwiw: I disagree, but I admit I've written the memoryssa update routines
>> for a ton of  both gcc and llvm passes.  This is not materially different
>> from the ir update you have to perform if you sink stores and their
>> operands with pre.
>> This is pretty much as complicated as it gets as well, because it's code
>> motion, not just replacement with existing values.
>>
>>> Maybe there is verifier in place, but in the code I don't even see a
>>> single assertion but could help debugging issues should something go wrong.
>>
>> There is a verifier that is run in both the updating  unit tests and the
>> test cases for this pass that verify the updating is correct.
>> Past that, removal and insertion routines will assert if you do something
>> that is obviously wrong, much more than the ir does The verifier in both
>> cases is used to catch screwups that cost too much to verify at all times.
>> The verifier can be called at any time, and I'm happy to add a call to it
>> here under xdebug if you like.
>>
>> Perhaps we can come up with some utility routines that make simpler to
>>> grasp: when two memory or more memory operation get moved, this is the
>>> blueprint for updating MSSA and guarantee it stays consistent.
>>>
>>
>> I think we could in general, but probably not for this kind of
>> specialized code motion.
>>
>> The simple cse case is already pretty simple (for loads, you do nothing
>> but remove the dead load. For stores , replacealluseswith your
>> definingacess, Remove the dead store).
>>
>> Note that only stores are ever complex to reason about when updating.
>> Because loads are just uses, they never require much because doing things
>> do them can't change the reaching versions or require new phi nodes (unless
>> you sink loads instead of hoist them, because both memoryssa and the normal
>> ir are pruned ssa, so they eliminate phi nodes that have no uses. We could
>> just turn this back off if we wanted to guarantee no load sinking would
>> ever require a new phi node, but I have not hit a pass that wants to sink
>> loads)
>>
>> For stores, general downward code motion can get complex enough that it
>> may be simpler and easier to not try to preserve the analysis.   This is
>> true of normal ssa as well.   This pass is pretty much the limit of what
>> I'd try to preserve rather than have it just recalculate memoryssa. The set
>> of update cases is fixed and formally provable.
>>
>> But thoughts definitely welcome.
>>
>>>
>>> ================
>>> Comment at: lib/Transforms/Scalar/MergedLoadStoreMotion.cpp:1099
>>> @@ +1098,3 @@
>>> +
>>> +  BasicBlock *MD0Block = MD0->getBlock();
>>> +  for (auto UI = MD0->use_begin(), UE = MD0->use_end(); UI != UE;) {
>>> ----------------
>>> There could be a lambda or function that takes care of the two loops..
>>>
>>
>>
>>
>>> http://reviews.llvm.org/D8688
>>>
>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160630/c5a9153d/attachment-0001.html>