[PATCH] D19821: [EarlyCSE] Optionally use MemorySSA. NFC.

Tue Aug 23 10:58:33 PDT 2016

Also just to note one more tradeoff:

The batch use optimizer that MemorySSA uses to optimize uses during
building is significantly (IE at least 2x, sometimes 10x) faster than
calling getClobberingMemoryAccess on every load.  It also is an order of
magnitude more memory efficient. The batch use optimizer is able to share
the storage necessary to optimize all loads in the program leading to max
O(N) space usage.

This is not possible to fix (one is a top-down algorithm, one is bottom up)
except by introducing a batch interface for getClobberingMemoryAccess that
worked top-down, and using it.

Given that every client so far wants to know about pretty much every load,
i don't think it would make sense to turn off use optimization during
building.

On Tue, Aug 23, 2016 at 10:47 AM, Daniel Berlin <dberlin at dberlin.org> wrote:

>
>
> On Tue, Aug 23, 2016 at 10:17 AM, Geoff Berry <gberry at codeaurora.org>
> wrote:
>
>> gberry added inline comments.
>>
>> ================
>> Comment at: lib/Transforms/Scalar/EarlyCSE.cpp:537
>> @@ +536,3 @@
>> +  MemoryAccess *LaterDef =
>> +      MSSA->getWalker()->getClobberingMemoryAccess(LaterInst);
>> +  return MSSA->dominates(LaterDef, MSSA->getMemoryAccess(EarlierInst));
>> ----------------
>> dberlin wrote:
>> > gberry wrote:
>> > > dberlin wrote:
>> > > > 1. For loads, you don't have to ask for the clobbering access. It's
>> already optimized such that getDefiningAccess == the clobbering access
>> > > >
>> > > > 2. For stores, not sure if you realize this, but
>> > > >
>> > > > given
>> > > > store q  (lets's call this a)
>> > > > x = load p
>> > > > store q (let's call this b)
>> > > >
>> > > >
>> > > > if you call getClobberingMemoryAccess on b, it will return a.
>> > > >
>> > > >
>> > > For 1., I was not clear on whether this holds true after store
>> removal.
>> > >
>> > > For 2., yeah I get this, I'm not sure what you're getting at though.
>> The removal of this second store by EarlyCSE doesn't use MemorySSA to check
>> for intervening loads in this change.  It uses the 'LastStore' tracking to
>> know when a store made redundant by a second store can be removed.
>> > 1. Updates have to make it hold after store removal :)
>> >
>> > The problem is that if we don't keep this invariant up to date, it
>> means everyone uses getClobberingAccess, which does a bunch of work to
>> discover the load already points to the same thing.
>> >
>> > Everyone doing that is much higher than the cost of one person updating
>> the dominating def.
>> >
>> > (there is one case were getClobberingAccess will give you a better
>> answer, and that is on cases where we gave up during use optimization. I
>> only have one testcase this occurs on.  We only give up on optimizing a
>> load if it's going to be super expensive, and you probably do *not* want to
>> try to get better answers in that case).
>> >
>> > As for updating when you remove stores, you should simply be able to
>> replace any loads the store uses with getClobberingAccess(store) using RAUW.
>> >
>> > Under the covers, removeMemoryAccess calls RAUW with the DefiningAccess.
>> > We could change it to use getClobberingMemoryAccess for loads, and
>> DefiningAccess for stores.
>> >
>> > 2. ah, okay.
>> >
>> >
>> Okay, I get why just checking the defining access for loads is better (we
>> get to skip the AA check).
>>
>
>
>> For stores, we may be able to do something faster than calling
>> getClobberingAccess(store).  We could instead walk up the store defining
>> access chain and stop if we get to a point that dominates the earlier load
>> or a clobbering write.  I'll have to time this to see if it makes a
>> difference.  I guess it will depend on what percentage of the time the
>> clobber cache has been thrown away.
>>
>
> Yes.  You are basically designing a custom walker here, and that's great.
> If it matters, I would just make it a walker class and
>
>>
>> As for updating when removing stores: it seems like doing RAUW
>> getClobberingAccess(store) is not optimal in some cases.
>
>
> Just out of curiosity do you have a real example of the below?  We work
> really hard to avoid generating that.
>
>
>
>> For example:
>>
>>   store @G1 ; 1 = MD(entry)
>>   store @G2 ; 2 = MD(1)
>>   store %p ; 3 = MD(2)
>>   load @G1 ; MU(3)
>>   load @G2  ; MU(3)
>>
>
> These should not be MU(3), they should be MU(1) and MU(2), unless the
> store is *actually* a clobber for them. If the store is actually a clobber,
> i don't see how you could legally remove or move the store (since it
> conflicts with G1/G2) :)
>
>
>
>>
>> If we remove 3 and RUAW getClobberingAccess(3) (=2) we get:
>>
>>   store @G1 ; 1 = MD(entry)
>>   store @G2 ; 2 = MD(1)
>>   load @G1 ; MU(2)
>>   load @G2  ; MU(2)
>
>
> I agree that if you can get memoryssa to generate the first, you will get
> the second and it will be non-optimal.  If you can, and it's legal to move
> or remove the store, that seems like a super-rare case unless i'm missing
> something?
>
> If you have a testcase with a real example, that would be
> super-interesting to me.
> Because if we can't get it to generate precise info in most if not all
> cases, it's  possibly not worth doing the work we are doing to optimize
> uses.
>
> I don't claim we make the perfect set of tradeoffs right now. I know for
> the use cases i cared about, it makes an IMHO good set of choices.
>
>
>
>>
> but the load @G1 would be more precise if it was MU(1) (and the invariant
>> that defining access == clobbering access would be broken).  Is this just a
>> compile-time/precision trade-off?
>
>
> Kinda. The main problem is it affects everyone downstream *right now*
> (again, we could change this at the cost of having the walker mutate loads).
>
> Imagine something upstream of you generates imprecise info for loads.
> You use getClobberingAccess to work around it, because you want the best
> answer.
>
> Assume everything preserves MemorySSA.
> We still can't really cache this sanely with a side-cache:  either it gets
> blown away, or takes up too much space.
> Caching the result of every load without rewriting is another O(N) space
> factor (we did it before, and george had to undo it :P).
>
> So now you wasted a bunch of time looking up the real clobbering result
> for the load.
> So will the next pass, and the pass after that, and ...
>
> Stores get pushed *upwards* a lot less than loads, so calling
> getClobberingMemoryAccess on them is more rare, and caching them is more
> sane, and "misses" are lower cost.
>
> For loads, we can work around having to have a separate cache by rewriting
> the defining access in the main walker as a cache (and marking them as
> optimized with a single bit or something, so we know not to waste time
> doing it again until the defining access gets reset again), but then the
> walker is mutating the MemorySSA, which is "not ideal" if we can avoid it.
>
> If we can't, such is life, i'm basically trying to kick this can as far
> down the road as i can before we use that solution :)
>
> But y'all are the ones doing a lot of the work at this point, so let me
> know if you think we should stop trying to maintain these invariants.
>
>
>
>> Maybe for that reason it makes more sense to let the client decide if
>> they want to do the simple RAUW getClobberingAccess(Store) or optimize each
>> use separately?
>>
>>
> If we can make it come out optimal for everyone, we should do that.
> Otherwise, one client screws up another :)
>
> If we can't, yeah, sure, then we start making tradeoffs.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160823/fc289f66/attachment.html>