<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Nov 1, 2016 at 9:41 AM, Bryant Wong <span dir="ltr"><<a href="mailto:3.14472+reviews.llvm.org@gmail.com" target="_blank">3.14472+reviews.llvm.org@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">For some reason, your most recent replies failed to arrive to my inbox, and none<br>

of them are visible on Phab. The below was what I could dig up from the<br>

llvm-commits website. Please let me know if anything's missing.<br>

<span class=""><br>

>> To give this a bit of context, this patch stems from issues that I've<br>

>> encountered while porting MemCpyOpt to MSSA.<br>

>><br>

><br>

</span><span class="">> Okay. I'm not sure i would try to port instead of just rewrite.  The whole<br>

> goal of MemorySSA is to enable us to write memory optimizations in non-N^2<br>

> ways.<br>

<br>

</span>By "rewrite," are you suggesting that I abandon incrementality altogether and<br>

create a new but analogous pass wholesale?</blockquote><div><br></div><div>I'm actually pretty positive you can reuse most of it, but reorder how it works quite a bit.</div><div>I have the beginnings of one :)<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> This would surely be liberating, but<br>

it seems that incremental patches are easier to review and likelier to be<br>

accepted.<br></blockquote><div><br></div><div>By who? :)</div><div>The goal of MemorySSA is in fact, to liberate us to rewrite some of the N^2 passes into O(N) passes.</div><div>Some of that may be fairly simple ports, some of it may require seriously rethinking passes.</div><div>If we can do it incrementally, great. If we can't, we shouldn't avoid it.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

The reason for asking is that there are areas of MCO that rely on MemDep in ways<br>

that aren't easily translated to MSSA. </blockquote><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">For instance, processStore requires<br>

moving possibly memory-related instructions upwards before transforming a<br>

load-store pair into a memcpy. This is problematic because MSSA currently<br>

doesn't support staying synchronized to instruction motion (as far as I've<br>

read).</blockquote><div>It does, for *moving* instructions, and *replacing them*, but not random insertion.</div><div>What you have just stated is perfectly supported.<br></div><div><br></div><div>You also have to differentiate between loads and stores.</div><div><br>Load movement, anywhere, and load insertion, is 100% trivial.</div><div>It's *store* insertion, and store movement, that is harder.</div><div>Store movement and replacement is doable (GVNHoist already does it).</div><div>But as i said, random store insertion is expensive, and in this case, not actually necessary.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Here's an example where upwards motion is necessary for the transformation to<br>

occur:<br>

<br>

  %S = type { i8*, i8, i32 }<br>

<br>

  define void @noaliasaddrproducer(%S* %src, %S* noalias %dst,<br>

                                    %S* noalias %dstidptr) {<br>

  ; MemoryUse(liveOnEntry)<br>

    %1 = load %S, %S* %src              ; combine this into a memcpy...<br>

  ; 1 = MemoryDef(liveOnEntry)<br>

    store %S zeroinitializer, %S* %src<br>

    %y = bitcast %S* %dstidptr to i8*<br>

  ; 2 = MemoryDef(1)<br>

    call void @llvm.memset.p0i8.i64(i8* %y, i8 0, i64 54, i32 0, i1 false)<br>

  ; 3 = MemoryDef(2)<br>

    store %S %1, %S* %dstidptr          ; ...with this.<br>

    ret void<br>

  }<br>

<br>

The store to %src prevents %1 from being moved down to its corresponding store, moving moving the bitcast and, in particular, the memset.</blockquote><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Currently, I handle this with a `spliceMemoryAccess" entry point that splices<br>

the AccessList around; RAU of 2 W 1; sets 2's def-access to liveOnEntry; and<br>

resetOptimized whichever MemoryUses depended on 2. Would such an API call be<br>

acceptable to add to MSSA?<br></blockquote><div><br></div><div> <br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

><br>

> If you just replace one querying with the other querying, this is not<br>

> likely to give you this result.<br>

> It may be faster (or slower).<br>

<br>

</span>My hunch is that even a tiny bit slower unoptimized usage of MSSA would still be<br>

faster than the current situation with MemDep. And a deal more versatile as<br>

well. For instance, nonlocal patterns that aren't touched now would be trivial<br>

to match with MSSA.<br>

<span class=""><br></span></blockquote><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">

><br>

> MemCpyOpt currently iterates in no particular order.<br>

> It should be possible to do it all in a single iteration, either bottom up<br>

> or top down.<br>

<br>

</span>I don't have an example right this moment, but don't Inst- and DAGCombines and a<br>

bunch of other tree reduction-esque passes that operate on post-isel, pre-alloc<br>

MachineInstr iterate until no changes are made?<br></blockquote><div><br></div><div>Yes, but i don't think anyone would claim this is great.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

><br>

><br>

>> Many of the transforms in this pass<br>

>> create entirely new memsets, memcpys, and memmoves (among others). After<br>

>> each<br>

>> new insertion, I expect to also have to update MSSA since 1)<br>

>> MemorySSAWrapperPass is delcared preserved to the pass manager, and 2)<br>

>> subsequent MCO iterations which depend on MSSA need to be able to see,<br>

>> analyze,<br>

>> and possibly further transform these newly added meminsts.<br>

>><br>

><br>

</span><span class="">> Can you give an example of a case where, processing in top-down order,<br>

> going from defs to uses, it's not possible to do this in a single iteration?<br>

><br>

> I understand why, giving a random ordering it does it in now, and no<br>

> use-def/def-use chains, it cannot do so.<br>

> I cannot, off the top of my head, think of a case it would not be possible<br>

> to do in a single iteration.<br>

><br>

><br>

</span><div><div class="h5">>> Here's an example: MemCpyOptPass::<wbr>tryMergingIntoMemset will convert the<br>

>> sequences of stores to %P into a memset:<br>

>><br>

>>   define void @foo(i64* nocapture %P, i64* %Q) {<br>

>>   entry:<br>

>>     %0 = bitcast i64* %P to i16*<br>

>>     %arrayidx = getelementptr inbounds i16, i16* %0, i64 1<br>

>>     %1 = bitcast i16* %arrayidx to i32*<br>

>>     %arrayidx1 = getelementptr inbounds i16, i16* %0, i64 3<br>

>>   ; 1 = MemoryDef(liveOnEntry)<br>

>>     store i16 0, i16* %0, align 2<br>

>>   ; 2 = MemoryDef(1)<br>

>>     store i32 0, i32* %1, align 4<br>

>>   ; 3 = MemoryDef(2)<br>

>>     store i16 0, i16* %arrayidx1, align 2<br>

>>   ; 4 = MemoryDef(3)<br>

>>     store i64 0, i64* %Q<br>

>>     ret void<br>

>>   }<br>

>><br>

>> Specifically, a memset is first inserted right before the store to %Q:<br>

>><br>

>>   define void @foo(i64* nocapture %P, i64* %Q) {<br>

>>   entry:<br>

>>     %0 = bitcast i64* %P to i16*<br>

>>     %arrayidx = getelementptr inbounds i16, i16* %0, i64 1<br>

>>     %1 = bitcast i16* %arrayidx to i32*<br>

>>     %arrayidx1 = getelementptr inbounds i16, i16* %0, i64 3<br>

>>   ; 1 = MemoryDef(liveOnEntry)<br>

>>     store i16 0, i16* %0, align 2<br>

>>   ; 2 = MemoryDef(1)<br>

>>     store i32 0, i32* %1, align 4<br>

>>   ; 3 = MemoryDef(2)<br>

>>     store i16 0, i16* %arrayidx1, align 2<br>

>><br>

>>     call void @llvm.memset.p0i8.i64(i8* %2, i8 0, i64 8, i32 2, i1 false)<br>

>> ; new<br>

>><br>

>>   ; 4 = MemoryDef(3)<br>

>>     store i64 0, i64* %Q<br>

>>     ret void<br>

>>   }<br>

>><br>

>> Next, a new MemoryAccess is created by passing to createMemoryAccessBefore<br>

>> the<br>

>> memset instruction, `3 = MemoryDef(2)` (the new access's defining access),<br>

>><br>

> and<br>

>> `4 = MemoryDef(3)` (the new access's insertion point).<br>

>><br>

><br>

</div></div><span class="">> This won't work, of course.<br>

><br>

> However, this memset is being created to eliminate stores, no?<br>

> What stores are you eliminating?<br>

><br>

> If it it's the stores at 1, 2, and 3, then for sure, the correct defining<br>

> access is *not* "3 = MemoryDef".<br>

><br>

> It's going to be "liveOnEntry" after you kill the stores.<br>

><br>

> Then, RAUW on each of the MemoryDef's works fine, and you removeAccess them.<br>

><br>

><br>

><br>

><br>

</span><span class="">>>   define void @foo(i64* nocapture %P, i64* %Q) {<br>

>>   entry:<br>

>>     %0 = bitcast i64* %P to i16*<br>

>>     %arrayidx = getelementptr inbounds i16, i16* %0, i64 1<br>

>>     %1 = bitcast i16* %arrayidx to i32*<br>

>>     %arrayidx1 = getelementptr inbounds i16, i16* %0, i64 3<br>

>>   ; 1 = MemoryDef(liveOnEntry)<br>

>>     store i16 0, i16* %0, align 2<br>

>>   ; 2 = MemoryDef(1)<br>

>>     store i32 0, i32* %1, align 4<br>

>>   ; 3 = MemoryDef(2)<br>

>>     store i16 0, i16* %arrayidx1, align 2<br>

>><br>

>>   ; 5 = MemoryDef(3)<br>

>>     call void @llvm.memset.p0i8.i64(i8* %2, i8 0, i64 8, i32 2, i1 false)<br>

>> ; new<br>

>><br>

>>   ; 4 = MemoryDef(3)<br>

>>     store i64 0, i64* %Q<br>

>>     ret void<br>

>>   }<br>

>><br>

>> This is already problematic because there's no way to selectively update<br>

>> the<br>

>> post-5 MemoryDefs that point to 5 instead of 3. Calling RAUW --<br>

>> specifically,<br>

>> replacing all uses of 3 with 5 -- would result in this:<br>

>><br>

><br>

</span><span class="">> I'm aware the existing APIs will not do what you want, because, as i said,<br>

> they are specifically and documented to only  be meant for replacement.<br>

><br>

><br>

> When i originally had API's to allow for more direct manipulation, folks<br>

> instead said "please build utilities to make common transforms work".<br>

><br>

> Thus, the existing API's make *replacement* work.<br>

<br>

</span>Why can't there be support for both replacement and insertion?</blockquote><div><br></div><div>Because random insertion can cause completely far flung effects that need updating arbitrarily far downstream.</div><div>But most legal movement and replacement cannot.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> How does<br>

supporting replacement preclude supporting insertion? Why can't there be entry<br>

points for both, with the ones for the latter labeled "Danger: Really Expensive.<br>

Will muck up the optimized MemoryUses."?<br></blockquote><div><br></div><div>I am, of course, willing to do that, i'm just trying to understand *is it actually necessary*.</div><div>So far, even for memcpyopt, i haven't found it to be necessary :)</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

><br>

> No amount of futzing is going to make them work well for insertion.<br>

><br>

</span><span class="">> If you want on the fly updating for stores, where you aren't doing<br>

> replacement, that's quite expensive.<br>

<br>

</span>What impact would insertions have? For a MemoryUse: Insert and optimize<br>

by searching upwards. </blockquote><div><br></div><div>Random MemoryUse insertion is now always easy, of course, because they do not create defs.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">For a MemoryDef: Insert, RAUW, and possibly regenerate the<br>

loads that pointed to the MemoryDefs that sit above us.</blockquote><div><br></div><div>This will not work</div><div>store bar</div><div>if (foo)<br>{<br>}</div><div>else</div><div>{</div><div>}</div><div><br></div><div>use bar.</div><div>If you insert a new store in either part of this branch, you need to create a phi node, and merge it, etc.</div><div><br></div><div>This is simple example, it gets arbitrarily hard.</div><div><br></div><div>Now, if you *replace* the store to bar with, say, a larger store to bar, that's easy.</div><div><br></div><div>But if you insert a new store, and then 50 years later, replace something else, that's going to be expensive to keep up to date.</div><div><br></div><div><br></div></div></div></div>