[PATCH] D19821: [EarlyCSE] Optionally use MemorySSA. NFC.

Wed Aug 24 16:34:26 PDT 2016

(modulo this, btw, the code LGTM.I definitely don't think we should wait to
resolve this before you commit it).

On Tue, Aug 23, 2016 at 1:32 PM, Daniel Berlin <dberlin at dberlin.org> wrote:

> Ugh.
> Okay, so.
> A couple thoughts:
>
> 1. We could maintain it by doing batch use opt at the end of each pass or
> something, and then using the faster use optimizer to reoptimize affected
> loads (it's easy to tell the set of possibly affected loads).
>
> 2. We could drop it.  We would still build an initially optimized form,
> but whatever happens past that, happens.  We would make
> getClobberingMemoryAccess in the default walker modify loads to store the
> optimized result on them, and add a bool to MemoryUse to track which loads
> may need to be reoptimized (that gets reset in setDefiningAccess unless you
> tell it not to).
>
> This will let you call getClobberingMemoryAccess on everything without it
> being really expensive for loads (right now it does a bunch of search/etc
> setup, and we need a way to avoid this)
>
> The downside to all of this is
>
> 1. The walker now modifies loads.  Even if you are storing the defining
> accesses somewhere, and we change it (not sure why this would happen, but
> ...), it should be okay, it would just mean you have a valid but
> conservative answer.
>
> 2.  It means a stores uses will no longer include all possible loads that
> die there. It will only include some, unless you run around and call
> getClobberingMemoryAccess on every load.
>
> I will have to do that in NewGVN to make load/store forwarding work
> sanely, but i'll deal with that.
>
>
> + a few others for thoughts.
>
>
> On Tue, Aug 23, 2016 at 1:04 PM, Geoff Berry <gberry at codeaurora.org>
> wrote:
>
>>
>>
>> On 8/23/2016 2:17 PM, Geoff Berry via llvm-commits wrote:
>>
>>
>>
>> On 8/23/2016 1:47 PM, Daniel Berlin wrote:
>>
>>
>>
>> On Tue, Aug 23, 2016 at 10:17 AM, Geoff Berry <gberry at codeaurora.org>
>> wrote:
>>
>>> gberry added inline comments.
>>>
>>> ================
>>> Comment at: lib/Transforms/Scalar/EarlyCSE.cpp:537
>>> @@ +536,3 @@
>>> +  MemoryAccess *LaterDef =
>>> +      MSSA->getWalker()->getClobberingMemoryAccess(LaterInst);
>>> +  return MSSA->dominates(LaterDef, MSSA->getMemoryAccess(EarlierInst));
>>> ----------------
>>> dberlin wrote:
>>> > gberry wrote:
>>> > > dberlin wrote:
>>> > > > 1. For loads, you don't have to ask for the clobbering access.
>>> It's already optimized such that getDefiningAccess == the clobbering access
>>> > > >
>>> > > > 2. For stores, not sure if you realize this, but
>>> > > >
>>> > > > given
>>> > > > store q  (lets's call this a)
>>> > > > x = load p
>>> > > > store q (let's call this b)
>>> > > >
>>> > > >
>>> > > > if you call getClobberingMemoryAccess on b, it will return a.
>>> > > >
>>> > > >
>>> > > For 1., I was not clear on whether this holds true after store
>>> removal.
>>> > >
>>> > > For 2., yeah I get this, I'm not sure what you're getting at
>>> though.  The removal of this second store by EarlyCSE doesn't use MemorySSA
>>> to check for intervening loads in this change.  It uses the 'LastStore'
>>> tracking to know when a store made redundant by a second store can be
>>> removed.
>>> > 1. Updates have to make it hold after store removal :)
>>> >
>>> > The problem is that if we don't keep this invariant up to date, it
>>> means everyone uses getClobberingAccess, which does a bunch of work to
>>> discover the load already points to the same thing.
>>> >
>>> > Everyone doing that is much higher than the cost of one person
>>> updating the dominating def.
>>> >
>>> > (there is one case were getClobberingAccess will give you a better
>>> answer, and that is on cases where we gave up during use optimization. I
>>> only have one testcase this occurs on.  We only give up on optimizing a
>>> load if it's going to be super expensive, and you probably do *not* want to
>>> try to get better answers in that case).
>>> >
>>> > As for updating when you remove stores, you should simply be able to
>>> replace any loads the store uses with getClobberingAccess(store) using RAUW.
>>> >
>>> > Under the covers, removeMemoryAccess calls RAUW with the
>>> DefiningAccess.
>>> > We could change it to use getClobberingMemoryAccess for loads, and
>>> DefiningAccess for stores.
>>> >
>>> > 2. ah, okay.
>>> >
>>> >
>>> Okay, I get why just checking the defining access for loads is better
>>> (we get to skip the AA check).
>>>
>>
>>
>>> For stores, we may be able to do something faster than calling
>>> getClobberingAccess(store).  We could instead walk up the store defining
>>> access chain and stop if we get to a point that dominates the earlier load
>>> or a clobbering write.  I'll have to time this to see if it makes a
>>> difference.  I guess it will depend on what percentage of the time the
>>> clobber cache has been thrown away.
>>>
>>
>> Yes.  You are basically designing a custom walker here, and that's great.
>> If it matters, I would just make it a walker class and
>>
>>>
>>> As for updating when removing stores: it seems like doing RAUW
>>> getClobberingAccess(store) is not optimal in some cases.
>>
>>
>> Just out of curiosity do you have a real example of the below?  We work
>> really hard to avoid generating that.
>>
>>
>>
>>> For example:
>>>
>>>   store @G1 ; 1 = MD(entry)
>>>   store @G2 ; 2 = MD(1)
>>>   store %p ; 3 = MD(2)
>>>   load @G1 ; MU(3)
>>>   load @G2  ; MU(3)
>>>
>>
>> These should not be MU(3), they should be MU(1) and MU(2), unless the
>> store is *actually* a clobber for them. If the store is actually a clobber,
>> i don't see how you could legally remove or move the store (since it
>> conflicts with G1/G2) :)
>>
>>
>>
>>>
>>> If we remove 3 and RUAW getClobberingAccess(3) (=2) we get:
>>>
>>>   store @G1 ; 1 = MD(entry)
>>>   store @G2 ; 2 = MD(1)
>>>   load @G1 ; MU(2)
>>>   load @G2  ; MU(2)
>>
>>
>> I agree that if you can get memoryssa to generate the first, you will get
>> the second and it will be non-optimal.  If you can, and it's legal to move
>> or remove the store, that seems like a super-rare case unless i'm missing
>> something?
>>
>>
>> Just to reply to this one particular point for now, here's a case where
>> this can happen because of EarlyCSE removing a store that is simply storing
>> back the value just loaded from the same memory location:
>>
>>
>> define void @test3(i32* %p) {
>> entry:
>> ; 1 = MemoryDef(liveOnEntry)
>>   store i32 1, i32* @G1
>> ; 2 = MemoryDef(1)
>>   store i32 2, i32* @G2
>> ; MemoryUse(2)
>>   %xp = load i32, i32* %p
>> ; 3 = MemoryDef(2)
>>   store i32 %xp, i32* %p
>> ; MemoryUse(3)
>>   %x1 = load i32, i32* @G1
>> ; MemoryUse(3)
>>   %x2 = load i32, i32* @G2
>> ; 4 = MemoryDef(3)
>>   call void @foo(i32 %x1, i32 %x2, i32 %xp)
>>   ret void
>> }
>>
>> Removing:   store i32 %xp, i32* %p
>>
>> define void @test3(i32* %p) {
>> entry:
>> ; 1 = MemoryDef(liveOnEntry)
>>   store i32 1, i32* @G1
>> ; 2 = MemoryDef(1)
>>   store i32 2, i32* @G2
>> ; MemoryUse(2)
>>   %xp = load i32, i32* %p
>> ; MemoryUse(2)
>>   %x1 = load i32, i32* @G1
>> ; MemoryUse(2)
>>   %x2 = load i32, i32* @G2
>> ; 4 = MemoryDef(2)
>>   call void @foo(i32 %x1, i32 %x2, i32 %xp)
>>   ret void
>> }
>>
>>
>> ... and to push this case a bit further, it seems like it wouldn't be too
>> hard to construct a case where removing this store would necessitate
>> looking down through MemoryPhis in order to maintain the optimized-use
>> invariant, possibly even leading to more MemoryPhi optimization?
>>
>>
>> If you have a testcase with a real example, that would be
>> super-interesting to me.
>> Because if we can't get it to generate precise info in most if not all
>> cases, it's  possibly not worth doing the work we are doing to optimize
>> uses.
>>
>> I don't claim we make the perfect set of tradeoffs right now. I know for
>> the use cases i cared about, it makes an IMHO good set of choices.
>>
>>
>>
>>>
>> but the load @G1 would be more precise if it was MU(1) (and the invariant
>>> that defining access == clobbering access would be broken).  Is this just a
>>> compile-time/precision trade-off?
>>
>>
>> Kinda. The main problem is it affects everyone downstream *right now*
>> (again, we could change this at the cost of having the walker mutate loads).
>>
>> Imagine something upstream of you generates imprecise info for loads.
>> You use getClobberingAccess to work around it, because you want the best
>> answer.
>>
>> Assume everything preserves MemorySSA.
>> We still can't really cache this sanely with a side-cache:  either it
>> gets blown away, or takes up too much space.
>> Caching the result of every load without rewriting is another O(N) space
>> factor (we did it before, and george had to undo it :P).
>>
>> So now you wasted a bunch of time looking up the real clobbering result
>> for the load.
>> So will the next pass, and the pass after that, and ...
>>
>> Stores get pushed *upwards* a lot less than loads, so calling
>> getClobberingMemoryAccess on them is more rare, and caching them is more
>> sane, and "misses" are lower cost.
>>
>> For loads, we can work around having to have a separate cache by
>> rewriting the defining access in the main walker as a cache (and marking
>> them as optimized with a single bit or something, so we know not to waste
>> time doing it again until the defining access gets reset again), but then
>> the walker is mutating the MemorySSA, which is "not ideal" if we can avoid
>> it.
>>
>> If we can't, such is life, i'm basically trying to kick this can as far
>> down the road as i can before we use that solution :)
>>
>> But y'all are the ones doing a lot of the work at this point, so let me
>> know if you think we should stop trying to maintain these invariants.
>>
>>
>>
>>> Maybe for that reason it makes more sense to let the client decide if
>>> they want to do the simple RAUW getClobberingAccess(Store) or optimize each
>>> use separately?
>>>
>>>
>> If we can make it come out optimal for everyone, we should do that.
>> Otherwise, one client screws up another :)
>>
>> If we can't, yeah, sure, then we start making tradeoffs.
>>
>>
>>
>> --
>> Geoff Berry
>> Employee of Qualcomm Datacenter Technologies, Inc.
>>  Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
>>
>>
>>
>> _______________________________________________
>> llvm-commits mailing listllvm-commits at lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>
>>
>> --
>> Geoff Berry
>> Employee of Qualcomm Datacenter Technologies, Inc.
>>  Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160824/17eef046/attachment.html>