Hello all! I've been busy working on a somewhat larger patch than usual, and Duncan indicated I should get a copy out to the list. This is essentially the re-thinking of my approach after this discussion: <a href="http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20120716/146864.html">http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20120716/146864.html</a><div>

I want to emphasize this is still a bit rough around the edges. =]</div><div><br></div><div><br></div><div>There are several very serious problems with SROA today:</div><div><br></div><div>- It is subject to the bane of my existence in optimizations: arbitrary thresholds.</div>

<div>- It is overly conservative about what all constructs can be SROA-ed.</div><div>- The vector-value conversion is layered on top of the splitting logic, missing many opportunities.</div><div>- It splits based the underlying type of the alloca, regardless of how the alloca is accessed.</div>

<div>- When the splitting fails (often due to thresholds), the vector-value conversion kicks in for preposterous cases to form "bit vectors" with i1024 and other large types.</div><div><br></div><div>Let's talk about the i1024 stuff first. This is actually what motivated me to start looking at this space. There are a few PRs about how this causes the backend to essentially give up and go home. In many other cases, such as essentially every obvious implementation of sha1 or other cryptographic hashing algorithm, instead the backend makes a heroic effort and recovers from these i1024 values. The cost is in compile time. For a sha1 implementation recently posted as having compile times 2x that of GCC, essentially all of the time is spent computing the demanded bits for the i1024 values flying all over the code. No matter how heroic the backend gets, analyzing i1024 is fundamentally expensive given the structure of APInt and various analysis passes. We also can't just disable this aspect of the SROA pass because *it's the only thing we have* for arrays of more than 8 elements. This is particularly bad whenever the frontend lowers the alloca as an 'i8' array, because that means that at 9 bytes, SROA shuts off except for this conversion pass.</div>

<div><br></div><div>Next let's talk about splitting based on the underlying type. What happens with C++ unions? Bad code. ;] In many cases, one side of the union ends up getting nuked as dead code, and by only attending to the *use* of the memory, we catch this by design. Another interesting aspect of designing the analysis around the *use* of the memory is that it suddenly applies equally well to malloc'ed memory.</div>

<div><br></div><div><br></div><div>Ok, so an overview of the design of the now pass.</div><div><br></div><div>First, credit to Duncan Sands. He and I were chatting on IRC, and I don't remember who mentioned it first, but this design is the result of our discussion.</div>

<div><br></div><div>The idea is to build a partitioning of the allocated memory by walking the uses of that memory and observing the region of memory touched by the use. We can then build mini-use-lists of the operations which actually refer to each partition. From these we can rewrite each partition as its own alloca and its users to refer directly to that alloca. The type of the alloca can be computed by examining the users of the partition, etc. This is complicated because some uses of memory can be *split*, while others cannot. Take the canonical example: memcpy. We can split a memcpy into two or three smaller memcpys at will. However, we cannot (in the general case) so easily split a load or a store. More importantly, we likely would not wish to split a load or a store as the point of SROA is to make SSA-able values. Thus what we really want to do is build a partitioning that attempts to assign a disjoint partition to each collection of overlapping unsplittable uses. The splittable uses will then be split as necessary to satisfy this partitioning.</div>

<div><br></div><div>The result is a pass which can handle arbitrarily large arrays and structures and split single value loads and stores out of these large allocas by splitting up any large memcpy or memsets. The code growth is generally limited by the number of these partitions formed, not by the size of the original buffer. The number of partitions is a function of the number of input instructions. This prevents the scaling problems that initially caused the thresholds in the current pass. However, building these partitions is not easy. The new pass is significantly larger and has some rather complex parts. On the whole, I suspect it is worth it to get the increased power and flexibility.</div>

<div><br></div><div>The other major change is that I have tried to very heavily use InstVisitor to drive the pass rather than writing out large loops with nested dyn-cast chains. While the InstVisitor often ends up being significantly more lines of code, I think the code is much better factored as a consequence.</div>

<div><br></div><div><br></div><div>I've attached a patch file, but really there isn't much of a patch. It's brand new code. Probably the best way to read it is over on github: <a href="https://github.com/chandlerc/llvm-hacking/blob/sroa-rewrite/lib/Transforms/Scalar/SROA.cpp">https://github.com/chandlerc/llvm-hacking/blob/sroa-rewrite/lib/Transforms/Scalar/SROA.cpp</a></div>

<div><br></div><div>There are still two major missing features:</div><div>1) rewriting overlapping accesses which look like element accesses on a vector to actually be element accesses on a vector</div><div>2) adding an SSAUpdater mode for promotion</div>

<div><br></div><div>#1 I'm currently working on.</div><div>#2 I would prefer to do in-tree as I would rather have it be obvious as the code moves from one pass to the other. I don't expect any really significant changes to this logic.</div>

<div><br></div><div>I know that Ben has already read quite a bit of the implementation code, so CC-ing him. Any comments or early review much appreciated.</div><div><br></div><div><br></div>