[PATCH] D19821: [EarlyCSE] Optionally use MemorySSA. NFC.

Tue Aug 23 13:32:07 PDT 2016

Ugh.
Okay, so.
A couple thoughts:

1. We could maintain it by doing batch use opt at the end of each pass or
something, and then using the faster use optimizer to reoptimize affected
loads (it's easy to tell the set of possibly affected loads).

2. We could drop it.  We would still build an initially optimized form, but
whatever happens past that, happens.  We would make
getClobberingMemoryAccess in the default walker modify loads to store the
optimized result on them, and add a bool to MemoryUse to track which loads
may need to be reoptimized (that gets reset in setDefiningAccess unless you
tell it not to).

This will let you call getClobberingMemoryAccess on everything without it
being really expensive for loads (right now it does a bunch of search/etc
setup, and we need a way to avoid this)

The downside to all of this is

1. The walker now modifies loads.  Even if you are storing the defining
accesses somewhere, and we change it (not sure why this would happen, but
...), it should be okay, it would just mean you have a valid but
conservative answer.

2.  It means a stores uses will no longer include all possible loads that
die there. It will only include some, unless you run around and call
getClobberingMemoryAccess on every load.

I will have to do that in NewGVN to make load/store forwarding work sanely,
but i'll deal with that.

+ a few others for thoughts.

On Tue, Aug 23, 2016 at 1:04 PM, Geoff Berry <gberry at codeaurora.org> wrote:

>
>
> On 8/23/2016 2:17 PM, Geoff Berry via llvm-commits wrote:
>
>
>
> On 8/23/2016 1:47 PM, Daniel Berlin wrote:
>
>
>
> On Tue, Aug 23, 2016 at 10:17 AM, Geoff Berry <gberry at codeaurora.org>
> wrote:
>
>> gberry added inline comments.
>>
>> ================
>> Comment at: lib/Transforms/Scalar/EarlyCSE.cpp:537
>> @@ +536,3 @@
>> +  MemoryAccess *LaterDef =
>> +      MSSA->getWalker()->getClobberingMemoryAccess(LaterInst);
>> +  return MSSA->dominates(LaterDef, MSSA->getMemoryAccess(EarlierInst));
>> ----------------
>> dberlin wrote:
>> > gberry wrote:
>> > > dberlin wrote:
>> > > > 1. For loads, you don't have to ask for the clobbering access. It's
>> already optimized such that getDefiningAccess == the clobbering access
>> > > >
>> > > > 2. For stores, not sure if you realize this, but
>> > > >
>> > > > given
>> > > > store q  (lets's call this a)
>> > > > x = load p
>> > > > store q (let's call this b)
>> > > >
>> > > >
>> > > > if you call getClobberingMemoryAccess on b, it will return a.
>> > > >
>> > > >
>> > > For 1., I was not clear on whether this holds true after store
>> removal.
>> > >
>> > > For 2., yeah I get this, I'm not sure what you're getting at though.
>> The removal of this second store by EarlyCSE doesn't use MemorySSA to check
>> for intervening loads in this change.  It uses the 'LastStore' tracking to
>> know when a store made redundant by a second store can be removed.
>> > 1. Updates have to make it hold after store removal :)
>> >
>> > The problem is that if we don't keep this invariant up to date, it
>> means everyone uses getClobberingAccess, which does a bunch of work to
>> discover the load already points to the same thing.
>> >
>> > Everyone doing that is much higher than the cost of one person updating
>> the dominating def.
>> >
>> > (there is one case were getClobberingAccess will give you a better
>> answer, and that is on cases where we gave up during use optimization. I
>> only have one testcase this occurs on.  We only give up on optimizing a
>> load if it's going to be super expensive, and you probably do *not* want to
>> try to get better answers in that case).
>> >
>> > As for updating when you remove stores, you should simply be able to
>> replace any loads the store uses with getClobberingAccess(store) using RAUW.
>> >
>> > Under the covers, removeMemoryAccess calls RAUW with the DefiningAccess.
>> > We could change it to use getClobberingMemoryAccess for loads, and
>> DefiningAccess for stores.
>> >
>> > 2. ah, okay.
>> >
>> >
>> Okay, I get why just checking the defining access for loads is better (we
>> get to skip the AA check).
>>
>
>
>> For stores, we may be able to do something faster than calling
>> getClobberingAccess(store).  We could instead walk up the store defining
>> access chain and stop if we get to a point that dominates the earlier load
>> or a clobbering write.  I'll have to time this to see if it makes a
>> difference.  I guess it will depend on what percentage of the time the
>> clobber cache has been thrown away.
>>
>
> Yes.  You are basically designing a custom walker here, and that's great.
> If it matters, I would just make it a walker class and
>
>>
>> As for updating when removing stores: it seems like doing RAUW
>> getClobberingAccess(store) is not optimal in some cases.
>
>
> Just out of curiosity do you have a real example of the below?  We work
> really hard to avoid generating that.
>
>
>
>> For example:
>>
>>   store @G1 ; 1 = MD(entry)
>>   store @G2 ; 2 = MD(1)
>>   store %p ; 3 = MD(2)
>>   load @G1 ; MU(3)
>>   load @G2  ; MU(3)
>>
>
> These should not be MU(3), they should be MU(1) and MU(2), unless the
> store is *actually* a clobber for them. If the store is actually a clobber,
> i don't see how you could legally remove or move the store (since it
> conflicts with G1/G2) :)
>
>
>
>>
>> If we remove 3 and RUAW getClobberingAccess(3) (=2) we get:
>>
>>   store @G1 ; 1 = MD(entry)
>>   store @G2 ; 2 = MD(1)
>>   load @G1 ; MU(2)
>>   load @G2  ; MU(2)
>
>
> I agree that if you can get memoryssa to generate the first, you will get
> the second and it will be non-optimal.  If you can, and it's legal to move
> or remove the store, that seems like a super-rare case unless i'm missing
> something?
>
>
> Just to reply to this one particular point for now, here's a case where
> this can happen because of EarlyCSE removing a store that is simply storing
> back the value just loaded from the same memory location:
>
>
> define void @test3(i32* %p) {
> entry:
> ; 1 = MemoryDef(liveOnEntry)
>   store i32 1, i32* @G1
> ; 2 = MemoryDef(1)
>   store i32 2, i32* @G2
> ; MemoryUse(2)
>   %xp = load i32, i32* %p
> ; 3 = MemoryDef(2)
>   store i32 %xp, i32* %p
> ; MemoryUse(3)
>   %x1 = load i32, i32* @G1
> ; MemoryUse(3)
>   %x2 = load i32, i32* @G2
> ; 4 = MemoryDef(3)
>   call void @foo(i32 %x1, i32 %x2, i32 %xp)
>   ret void
> }
>
> Removing:   store i32 %xp, i32* %p
>
> define void @test3(i32* %p) {
> entry:
> ; 1 = MemoryDef(liveOnEntry)
>   store i32 1, i32* @G1
> ; 2 = MemoryDef(1)
>   store i32 2, i32* @G2
> ; MemoryUse(2)
>   %xp = load i32, i32* %p
> ; MemoryUse(2)
>   %x1 = load i32, i32* @G1
> ; MemoryUse(2)
>   %x2 = load i32, i32* @G2
> ; 4 = MemoryDef(2)
>   call void @foo(i32 %x1, i32 %x2, i32 %xp)
>   ret void
> }
>
>
> ... and to push this case a bit further, it seems like it wouldn't be too
> hard to construct a case where removing this store would necessitate
> looking down through MemoryPhis in order to maintain the optimized-use
> invariant, possibly even leading to more MemoryPhi optimization?
>
>
> If you have a testcase with a real example, that would be
> super-interesting to me.
> Because if we can't get it to generate precise info in most if not all
> cases, it's  possibly not worth doing the work we are doing to optimize
> uses.
>
> I don't claim we make the perfect set of tradeoffs right now. I know for
> the use cases i cared about, it makes an IMHO good set of choices.
>
>
>
>>
> but the load @G1 would be more precise if it was MU(1) (and the invariant
>> that defining access == clobbering access would be broken).  Is this just a
>> compile-time/precision trade-off?
>
>
> Kinda. The main problem is it affects everyone downstream *right now*
> (again, we could change this at the cost of having the walker mutate loads).
>
> Imagine something upstream of you generates imprecise info for loads.
> You use getClobberingAccess to work around it, because you want the best
> answer.
>
> Assume everything preserves MemorySSA.
> We still can't really cache this sanely with a side-cache:  either it gets
> blown away, or takes up too much space.
> Caching the result of every load without rewriting is another O(N) space
> factor (we did it before, and george had to undo it :P).
>
> So now you wasted a bunch of time looking up the real clobbering result
> for the load.
> So will the next pass, and the pass after that, and ...
>
> Stores get pushed *upwards* a lot less than loads, so calling
> getClobberingMemoryAccess on them is more rare, and caching them is more
> sane, and "misses" are lower cost.
>
> For loads, we can work around having to have a separate cache by rewriting
> the defining access in the main walker as a cache (and marking them as
> optimized with a single bit or something, so we know not to waste time
> doing it again until the defining access gets reset again), but then the
> walker is mutating the MemorySSA, which is "not ideal" if we can avoid it.
>
> If we can't, such is life, i'm basically trying to kick this can as far
> down the road as i can before we use that solution :)
>
> But y'all are the ones doing a lot of the work at this point, so let me
> know if you think we should stop trying to maintain these invariants.
>
>
>
>> Maybe for that reason it makes more sense to let the client decide if
>> they want to do the simple RAUW getClobberingAccess(Store) or optimize each
>> use separately?
>>
>>
> If we can make it come out optimal for everyone, we should do that.
> Otherwise, one client screws up another :)
>
> If we can't, yeah, sure, then we start making tradeoffs.
>
>
>
> --
> Geoff Berry
> Employee of Qualcomm Datacenter Technologies, Inc.
>  Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
>
>
>
> _______________________________________________
> llvm-commits mailing listllvm-commits at lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>
>
> --
> Geoff Berry
> Employee of Qualcomm Datacenter Technologies, Inc.
>  Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160823/51366677/attachment.html>