[llvm-commits] PATCH: A new SROA implementation
chandlerc at gmail.com
Mon Aug 20 05:34:18 PDT 2012
Adding llvm-dev to this thread as this is probably worth wider discussion.
See the original patch email below.
If you're sending just code-related comments, consider dropping llvm-dev.
On Mon, Aug 20, 2012 at 3:37 AM, Chandler Carruth <chandlerc at gmail.com>wrote:
> Hello all! I've been busy working on a somewhat larger patch than usual,
> and Duncan indicated I should get a copy out to the list. This is
> essentially the re-thinking of my approach after this discussion:
> I want to emphasize this is still a bit rough around the edges. =]
> There are several very serious problems with SROA today:
> - It is subject to the bane of my existence in optimizations: arbitrary
> - It is overly conservative about what all constructs can be SROA-ed.
> - The vector-value conversion is layered on top of the splitting logic,
> missing many opportunities.
> - It splits based the underlying type of the alloca, regardless of how the
> alloca is accessed.
> - When the splitting fails (often due to thresholds), the vector-value
> conversion kicks in for preposterous cases to form "bit vectors" with i1024
> and other large types.
> Let's talk about the i1024 stuff first. This is actually what motivated me
> to start looking at this space. There are a few PRs about how this causes
> the backend to essentially give up and go home. In many other cases, such
> as essentially every obvious implementation of sha1 or other cryptographic
> hashing algorithm, instead the backend makes a heroic effort and recovers
> from these i1024 values. The cost is in compile time. For a sha1
> implementation recently posted as having compile times 2x that of GCC,
> essentially all of the time is spent computing the demanded bits for the
> i1024 values flying all over the code. No matter how heroic the backend
> gets, analyzing i1024 is fundamentally expensive given the structure of
> APInt and various analysis passes. We also can't just disable this aspect
> of the SROA pass because *it's the only thing we have* for arrays of more
> than 8 elements. This is particularly bad whenever the frontend lowers the
> alloca as an 'i8' array, because that means that at 9 bytes, SROA shuts off
> except for this conversion pass.
> Next let's talk about splitting based on the underlying type. What happens
> with C++ unions? Bad code. ;] In many cases, one side of the union ends up
> getting nuked as dead code, and by only attending to the *use* of the
> memory, we catch this by design. Another interesting aspect of designing
> the analysis around the *use* of the memory is that it suddenly applies
> equally well to malloc'ed memory.
> Ok, so an overview of the design of the now pass.
> First, credit to Duncan Sands. He and I were chatting on IRC, and I don't
> remember who mentioned it first, but this design is the result of our
> The idea is to build a partitioning of the allocated memory by walking the
> uses of that memory and observing the region of memory touched by the use.
> We can then build mini-use-lists of the operations which actually refer to
> each partition. From these we can rewrite each partition as its own alloca
> and its users to refer directly to that alloca. The type of the alloca can
> be computed by examining the users of the partition, etc. This is
> complicated because some uses of memory can be *split*, while others
> cannot. Take the canonical example: memcpy. We can split a memcpy into two
> or three smaller memcpys at will. However, we cannot (in the general case)
> so easily split a load or a store. More importantly, we likely would not
> wish to split a load or a store as the point of SROA is to make SSA-able
> values. Thus what we really want to do is build a partitioning that
> attempts to assign a disjoint partition to each collection of overlapping
> unsplittable uses. The splittable uses will then be split as necessary to
> satisfy this partitioning.
> The result is a pass which can handle arbitrarily large arrays and
> structures and split single value loads and stores out of these large
> allocas by splitting up any large memcpy or memsets. The code growth is
> generally limited by the number of these partitions formed, not by the size
> of the original buffer. The number of partitions is a function of the
> number of input instructions. This prevents the scaling problems that
> initially caused the thresholds in the current pass. However, building
> these partitions is not easy. The new pass is significantly larger and has
> some rather complex parts. On the whole, I suspect it is worth it to get
> the increased power and flexibility.
> The other major change is that I have tried to very heavily use
> InstVisitor to drive the pass rather than writing out large loops with
> nested dyn-cast chains. While the InstVisitor often ends up being
> significantly more lines of code, I think the code is much better factored
> as a consequence.
> I've attached a patch file, but really there isn't much of a patch. It's
> brand new code. Probably the best way to read it is over on github:
> There are still two major missing features:
> 1) rewriting overlapping accesses which look like element accesses on a
> vector to actually be element accesses on a vector
> 2) adding an SSAUpdater mode for promotion
> #1 I'm currently working on.
> #2 I would prefer to do in-tree as I would rather have it be obvious as
> the code moves from one pass to the other. I don't expect any really
> significant changes to this logic.
> I know that Ben has already read quite a bit of the implementation code,
> so CC-ing him. Any comments or early review much appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-commits