[llvm-commits] PATCH: A new SROA implementation

Mon Aug 20 09:51:41 PDT 2012

On Mon, Aug 20, 2012 at 3:37 AM, Chandler Carruth <chandlerc at gmail.com> wrote:
> Hello all! I've been busy working on a somewhat larger patch than usual, and
> Duncan indicated I should get a copy out to the list. This is essentially
> the re-thinking of my approach after this discussion:
> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20120716/146864.html
> I want to emphasize this is still a bit rough around the edges. =]
>
>
> There are several very serious problems with SROA today:
>
> - It is subject to the bane of my existence in optimizations: arbitrary
> thresholds.
> - It is overly conservative about what all constructs can be SROA-ed.
> - The vector-value conversion is layered on top of the splitting logic,
> missing many opportunities.
> - It splits based the underlying type of the alloca, regardless of how the
> alloca is accessed.
> - When the splitting fails (often due to thresholds), the vector-value
> conversion kicks in for preposterous cases to form "bit vectors" with i1024
> and other large types.
>
> Let's talk about the i1024 stuff first. This is actually what motivated me
> to start looking at this space. There are a few PRs about how this causes
> the backend to essentially give up and go home. In many other cases, such as
> essentially every obvious implementation of sha1 or other cryptographic
> hashing algorithm, instead the backend makes a heroic effort and recovers
> from these i1024 values. The cost is in compile time. For a sha1
> implementation recently posted as having compile times 2x that of GCC,
> essentially all of the time is spent computing the demanded bits for the
> i1024 values flying all over the code. No matter how heroic the backend
> gets, analyzing i1024 is fundamentally expensive given the structure of
> APInt and various analysis passes. We also can't just disable this aspect of
> the SROA pass because *it's the only thing we have* for arrays of more than
> 8 elements. This is particularly bad whenever the frontend lowers the alloca
> as an 'i8' array, because that means that at 9 bytes, SROA shuts off except
> for this conversion pass.
>
> Next let's talk about splitting based on the underlying type. What happens
> with C++ unions? Bad code. ;] In many cases, one side of the union ends up
> getting nuked as dead code, and by only attending to the *use* of the
> memory, we catch this by design. Another interesting aspect of designing the
> analysis around the *use* of the memory is that it suddenly applies equally
> well to malloc'ed memory.
>
>
> Ok, so an overview of the design of the now pass.
>
> First, credit to Duncan Sands. He and I were chatting on IRC, and I don't
> remember who mentioned it first, but this design is the result of our
> discussion.
>
> The idea is to build a partitioning of the allocated memory by walking the
> uses of that memory and observing the region of memory touched by the use.
> We can then build mini-use-lists of the operations which actually refer to
> each partition. From these we can rewrite each partition as its own alloca
> and its users to refer directly to that alloca. The type of the alloca can
> be computed by examining the users of the partition, etc. This is
> complicated because some uses of memory can be *split*, while others cannot.
> Take the canonical example: memcpy. We can split a memcpy into two or three
> smaller memcpys at will. However, we cannot (in the general case) so easily
> split a load or a store. More importantly, we likely would not wish to split
> a load or a store as the point of SROA is to make SSA-able values. Thus what
> we really want to do is build a partitioning that attempts to assign a
> disjoint partition to each collection of overlapping unsplittable uses. The
> splittable uses will then be split as necessary to satisfy this
> partitioning.
>
> The result is a pass which can handle arbitrarily large arrays and
> structures and split single value loads and stores out of these large
> allocas by splitting up any large memcpy or memsets. The code growth is
> generally limited by the number of these partitions formed, not by the size
> of the original buffer. The number of partitions is a function of the number
> of input instructions. This prevents the scaling problems that initially
> caused the thresholds in the current pass. However, building these
> partitions is not easy. The new pass is significantly larger and has some
> rather complex parts. On the whole, I suspect it is worth it to get the
> increased power and flexibility.
>
> The other major change is that I have tried to very heavily use InstVisitor
> to drive the pass rather than writing out large loops with nested dyn-cast
> chains. While the InstVisitor often ends up being significantly more lines
> of code, I think the code is much better factored as a consequence.
>
>
> I've attached a patch file, but really there isn't much of a patch. It's
> brand new code. Probably the best way to read it is over on github:
> https://github.com/chandlerc/llvm-hacking/blob/sroa-rewrite/lib/Transforms/Scalar/SROA.cpp
>
> There are still two major missing features:
> 1) rewriting overlapping accesses which look like element accesses on a
> vector to actually be element accesses on a vector
> 2) adding an SSAUpdater mode for promotion
>
> #1 I'm currently working on.
> #2 I would prefer to do in-tree as I would rather have it be obvious as the
> code moves from one pass to the other. I don't expect any really significant
> changes to this logic.
>
> I know that Ben has already read quite a bit of the implementation code, so
> CC-ing him. Any comments or early review much appreciated.

It looks like you're missing handling for debug intrinsics.

I don't think your usage of inbounds for computing properties of GEPs is safe.

How does your code handle accesses to following union?

union DoubleBits {
  double x;
  int y[2];
};

I'm sort of worried about allowing SROA so much freedom in terms of
splitting memset/memcpy; your algorithm can turn a single memcpy into
an arbitrary number of memcpys (which is a quadratic explosion in
codesize in the worst case).  Not sure if it's an issue in practice.

Minor nit:

    if (SI.getOperand(0) == U.Ptr)
      return false;

This is overly tricky; SI.getOperand(1) != U.Ptr, please.

-Eli