As with my responses to Duncan, I'm replying to issues of interest here, and will post a new patch after incorporating your feedback.<div><br></div><div>Thanks again for taking the time to look at this.<br><br><div class="gmail_quote">

On Wed, Aug 22, 2012 at 8:35 AM, Chris Lattner <span dir="ltr"><<a href="mailto:clattner@apple.com" target="_blank">clattner@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">On Aug 21, 2012, at 10:46 PM, Chandler Carruth <<a href="mailto:chandlerc@gmail.com">chandlerc@gmail.com</a>> wrote:<br>

> Ok, I think this patch is pretty close to being ready to go into the tree, modulo any detailed code review folks are willing to do. Just send me comments if you have them.<br>

<br>

</div>Very nice, I look forward to never seeing an i1024 from SRoA again :)<br>

<br>

Here are some initial thoughts, I haven't done a detailed algorithmic analysis of this yet:<br>

<div class="im"><br>

> I've run it through the nightly test suite now as well as a bootstrap, and everything looks better than clear, it looks great. ;]<br>

><br>

> I'm seeing some pretty significant execution performance improvements. One test is 27% faster, several range from 2% to 6% faster. Only a couple really slow down. I haven't dug in to the performance too much as currently we're still running the old SROA in the CGSCC passes.<br>


><br>

> I've still got some code cleanup, minor refactorings I'd like to do, and the SSAUpdater work, but I'm trying to not grow the patch on people.<br>

<br>

</div>+/// which are merely transfered in and out of external memory remain unchanged<br>

<br>

typo: transferred.<br>

<br>

+/// Because this also performs alloca promotion, it can be thought of as a also<br>

<br>

"as a also"<br>

<br>

Several #includes can be pruned out, GlobalVariable and Function.h at least.<br></blockquote><div><br></div><div>Yes, I've not pruned the includes from the old pass. Sorry about any confusion there, I just didn't want to mess w/ compilation errors.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<br>

+  struct Partition {<br>

+  struct PartitionUse {<br>

<br>

Doxygen comments please.  These look similar, the comment should explain what they're for and why they're different.<br></blockquote><div><br></div><div>Done, and one better. They now share a common base to factor away the commonality.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

+  /// \brief MemTransferInst auxilliary data.<br>

<br>

typo "auxilliary" in a few places.<br>

<br>

<br>

<br>

I'm not crazy about template instantiation of your entire "BuilderBase" class, how much code size does this cost, to what benefit?<br></blockquote><div><br></div><div>I don't know, and frankly I didn't even measure it. It's four relatively small functions, and some instance variables. I mostly factored it out because I was tired of fixing bugs in two places, and sharing the instance variables was a real pain.</div>

<div><br></div><div>I can't imagine this is a big deal, at two copies of these functions. If it is, I would actually blame the compiler. ;] At a fundamental level I don't see a lot of easy ways to make the base class not be templated. I tried a few times and it always ended up really uglying the code up....</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<br>

+static Value *getNaturalGEPWithType(IRBuilder<> IRB, const TargetData &TD,<br>

+                                    Value *BasePtr, Type *Ty, Type *TargetTy,<br>

+                                    SmallVectorImpl<Value *> &Indices,<br>

+                                    const Twine &Prefix) {<br>

...<br>

+    if (SequentialType *SeqTy = dyn_cast<SequentialType>(ElementTy)) {<br>

<br>

Do we really want to form GEP's into vectors?<br></blockquote><div><br></div><div>Hmm, I suppose I would be OK having the rewriter canonicalize *all* vector accesses as whole vector accesses with insert element and extract element. However, in the case that this is buried within a much larger unpromoteable structure, and we may never access the vector as a whole, I worry a bit about actually making the code bigger as a consequence.</div>

<div><br></div><div>It seems more important to work hard at doing this when there is a chance of promotion.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<br>

<br>

+class SROA : public FunctionPass {<br>

+bool SROA::rewriteAllocaPartition(AllocaInst &AI,<br>

...<br>

<br>

Doxygen comments please.<br></blockquote><div><br></div><div>Yes, many many doxygen comments (and non-doxygen comments) have been added.</div><div><br></div><div>Is my mixture of doxygen / non-doxygen comments problematic? I tend to use doxygen comments whenever the comment is really documenting an interface, and not when the comment only makes since as an aid to the reader of the code itself.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<br>

getNaturalGEPWithType and others can be simplified through the use of IRB.getInt32()<br></blockquote><div><br></div><div>Wow, yea. I'm sad I kept missing these helpers when reading IRBuilder.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

getNaturalGEPWithOffset, getAdjustedPtr, etc should pass APInts by const reference.<br>

<br>

TargetTy != IRB.getInt8Ty()<br>

-> should be "!TargetTy->isIntegerTy(8)" for efficiency.<br>

<br>

<br>

+    // Find the operand we need to rewrite here.<br>

+    User::op_iterator OI = PN.op_begin(), OE = PN.op_end();<br>

+    for (; OI != OE && *OI != OldPtr; ++OI)<br>

+      ;<br>

+    assert(OI != OE);<br>

<br>

Is it feasible to not lose the Use index?  PHI nodes can have a lot of operands to scan though.<br></blockquote><div><br></div><div>Well, this code is really just broken.</div><div><br></div><div>There isn't a single index -- the PHI might use the same pointer in several different operands. We unique stuff so we only will visit the PHI node once for a particular used value, but we need to replace that value anywhere within the PHI. I've fixed the code to do this. The only other way I can see to write this is to walk the use list of the OldPtr and check each User is the PHI node? Not sure that would be any cheaper.</div>

<div><br>Note that we also do a full linear walk of the operands when doing speculation. =/</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">bool SROA::runOnAlloca(AllocaInst &AI) {<br>


...<br>

<br>

This code handles the case when something is a trivially promotable scalar (an important base case for SRoA) but it doesn't handle the case where we have a trivial aggregate aggregate being accessed in a "type safe" way.  Is this an interesting case to handle specially to improve compile time?<br>

</blockquote><div><br></div><div>Can you clarify what you mean here by 'trivial aggregate aggregate being accessed in a "type safe" way"? I'm not sure which exact case you're referring to.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">+  // No partitions to split. Leave the dead alloca for a later pass to clean up.<br>

+  if (P.begin() == P.end())<br>

<br>

P.empty()?<br></blockquote><div><br></div><div>One more method I have to add to AllocaPartitioning. With only one callsite it hardly seemed worth it.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<br>

+void SROA::deleteDeadInstructions() {<br>

<br>

Is this the same as RecursivelyDeleteTriviallyDeadInstructions?</blockquote><div><br></div><div>Yep, it's just an attempt to decrease the malloc traffic by re-using the nice scratch space we already have. I also cheat and clear some of the state at the same time. It also keeps nice statistics. Not sure how much its worth re-using the recursive function, the bulk of the work is in isInstructionTriviallyDead which it already uses.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">  Perhaps RecursivelyDeleteTriviallyDeadInstructions could be enhanced to take an optional pointer to a vector, where (when present) it records all the instruction*'s it deletes?<br>

</blockquote><div><br></div><div>Yea, I can see a way to factor the existing utility function to be what I want here. I can do that if you want either now or later.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

It would be nice if the "load speculation" stuff was pulled out of AllocaPartitionRewriter as some sort of orthogonal transformation.  Perhaps the initial analysis of an alloca can build a list of "selects and phis that need to be speculated" in order to unblock the transformation, and they could be processed independently of the other heavy lifting going on.<br>

</blockquote><div><br></div><div>This is definitely something I think should happen eventually. There should be a simple function call that does the speculation for a PHI node or a select. That said, I don't really want to do this until the pass stabilizes a bit in the interest of avoiding hoisting and abstracting the wrong logic here.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">It might make sense to split this pass up into two files to separate the partition analysis code from the transformation and utility code.  Just a single header file (in Transforms/Scalar) might be enough, or go with another header + .cpp file in Transforms/Scalar.  What do you think?<br>

</blockquote><div><br></div><div>Meh. I mean, I see how I could split it up like that, definitely, but it doesn't quite seem worth it at this stage. Maybe later as it gets more complicated?</div><div><br></div><div>I would vote doing this later if at all. It'd just get in my way to have it in multiple files right now.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">FYI, on a random note, Dan Gohman is looking enhancing memcpy to get TBAA and "hole" information per PR12260/PR11946, which seems like it should drop in nicely.</blockquote>

<div><br></div><div>Yep. I have been talking with him (and thanks for forwarding the proposal while the mailing lists were down). I actually have a counter proposal for one particular aspect of it I need to write up. ;] Anyways, thanks for the feedback. I'm going to try to squash a few crashers Duncan found and then post an updated patch.</div>

</div></div>