[llvm-commits] PATCH: A new SROA implementation

Tue Aug 21 00:57:10 PDT 2012

On Mon, Aug 20, 2012 at 9:51 AM, Eli Friedman <eli.friedman at gmail.com>wrote:

> On Mon, Aug 20, 2012 at 3:37 AM, Chandler Carruth <chandlerc at gmail.com>
> wrote:
> > Hello all! I've been busy working on a somewhat larger patch than usual,
> and
> > Duncan indicated I should get a copy out to the list. This is essentially
> > the re-thinking of my approach after this discussion:
> >
> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20120716/146864.html
> > I want to emphasize this is still a bit rough around the edges. =]
> >
> >
> > There are several very serious problems with SROA today:
> >
> > - It is subject to the bane of my existence in optimizations: arbitrary
> > thresholds.
> > - It is overly conservative about what all constructs can be SROA-ed.
> > - The vector-value conversion is layered on top of the splitting logic,
> > missing many opportunities.
> > - It splits based the underlying type of the alloca, regardless of how
> the
> > alloca is accessed.
> > - When the splitting fails (often due to thresholds), the vector-value
> > conversion kicks in for preposterous cases to form "bit vectors" with
> i1024
> > and other large types.
> >
> > Let's talk about the i1024 stuff first. This is actually what motivated
> me
> > to start looking at this space. There are a few PRs about how this causes
> > the backend to essentially give up and go home. In many other cases,
> such as
> > essentially every obvious implementation of sha1 or other cryptographic
> > hashing algorithm, instead the backend makes a heroic effort and recovers
> > from these i1024 values. The cost is in compile time. For a sha1
> > implementation recently posted as having compile times 2x that of GCC,
> > essentially all of the time is spent computing the demanded bits for the
> > i1024 values flying all over the code. No matter how heroic the backend
> > gets, analyzing i1024 is fundamentally expensive given the structure of
> > APInt and various analysis passes. We also can't just disable this
> aspect of
> > the SROA pass because *it's the only thing we have* for arrays of more
> than
> > 8 elements. This is particularly bad whenever the frontend lowers the
> alloca
> > as an 'i8' array, because that means that at 9 bytes, SROA shuts off
> except
> > for this conversion pass.
> >
> > Next let's talk about splitting based on the underlying type. What
> happens
> > with C++ unions? Bad code. ;] In many cases, one side of the union ends
> up
> > getting nuked as dead code, and by only attending to the *use* of the
> > memory, we catch this by design. Another interesting aspect of designing
> the
> > analysis around the *use* of the memory is that it suddenly applies
> equally
> > well to malloc'ed memory.
> >
> >
> > Ok, so an overview of the design of the now pass.
> >
> > First, credit to Duncan Sands. He and I were chatting on IRC, and I don't
> > remember who mentioned it first, but this design is the result of our
> > discussion.
> >
> > The idea is to build a partitioning of the allocated memory by walking
> the
> > uses of that memory and observing the region of memory touched by the
> use.
> > We can then build mini-use-lists of the operations which actually refer
> to
> > each partition. From these we can rewrite each partition as its own
> alloca
> > and its users to refer directly to that alloca. The type of the alloca
> can
> > be computed by examining the users of the partition, etc. This is
> > complicated because some uses of memory can be *split*, while others
> cannot.
> > Take the canonical example: memcpy. We can split a memcpy into two or
> three
> > smaller memcpys at will. However, we cannot (in the general case) so
> easily
> > split a load or a store. More importantly, we likely would not wish to
> split
> > a load or a store as the point of SROA is to make SSA-able values. Thus
> what
> > we really want to do is build a partitioning that attempts to assign a
> > disjoint partition to each collection of overlapping unsplittable uses.
> The
> > splittable uses will then be split as necessary to satisfy this
> > partitioning.
> >
> > The result is a pass which can handle arbitrarily large arrays and
> > structures and split single value loads and stores out of these large
> > allocas by splitting up any large memcpy or memsets. The code growth is
> > generally limited by the number of these partitions formed, not by the
> size
> > of the original buffer. The number of partitions is a function of the
> number
> > of input instructions. This prevents the scaling problems that initially
> > caused the thresholds in the current pass. However, building these
> > partitions is not easy. The new pass is significantly larger and has some
> > rather complex parts. On the whole, I suspect it is worth it to get the
> > increased power and flexibility.
> >
> > The other major change is that I have tried to very heavily use
> InstVisitor
> > to drive the pass rather than writing out large loops with nested
> dyn-cast
> > chains. While the InstVisitor often ends up being significantly more
> lines
> > of code, I think the code is much better factored as a consequence.
> >
> >
> > I've attached a patch file, but really there isn't much of a patch. It's
> > brand new code. Probably the best way to read it is over on github:
> >
> https://github.com/chandlerc/llvm-hacking/blob/sroa-rewrite/lib/Transforms/Scalar/SROA.cpp
> >
> > There are still two major missing features:
> > 1) rewriting overlapping accesses which look like element accesses on a
> > vector to actually be element accesses on a vector
> > 2) adding an SSAUpdater mode for promotion
> >
> > #1 I'm currently working on.
> > #2 I would prefer to do in-tree as I would rather have it be obvious as
> the
> > code moves from one pass to the other. I don't expect any really
> significant
> > changes to this logic.
> >
> > I know that Ben has already read quite a bit of the implementation code,
> so
> > CC-ing him. Any comments or early review much appreciated.
>
> It looks like you're missing handling for debug intrinsics.
>

And lifetime intrinsics. Like I said, it's not complete. The big question
is when it is complete enough to go into the tree under a flag, vs. on by
default, etc.

I don't think your usage of inbounds for computing properties of GEPs is
> safe.
>

Nope. It's horribly broken. Thanks for pushing me on this issue, I had a
misunderstanding of how GEPs worked, because I assumed rather than reading
the spec carefully.

Funny thing? My assumption was based on the existing SROA pass which also
gets this wrong:

#include <stdio.h>
> struct S {
>   int x;
>   int __attribute__((vector_size(16))) v;
>   int y;
> };
> struct S f(int a, int b, int c, int i) __attribute__((noinline)) {
>   struct S s;
>   volatile int q = c;
>   s.x = a;
>   ((int*)&s.v)[i] = b;
>   s.y = q;
>   return s;
> }
> int main() {
>   struct S s = f(1, -1, 3, -1);
>   printf("%d %d\n", s.x, s.y);
> }

Clang miscompiles this program at -O2 because of SROA I think... At least,
the LLVM IR produced seems valid according to the spec of the IR, and the
transform applied by SROA seems invalid -- the index of -1 doesn't overflow
anything, it just walks past the end of a sequential type, which everything
indicates is allowed.

We happen to be getting away with this because this transform in the
existing SROA is only ever applied to vectors, never to arrays or pointers,
or anything that happens to be lowered as an array or pointer.

I'm inclined to completely remove this transform. It seems fundamentally
unsafe given the current spec of GEPs, and provides fairly limited benefit
if the frontend consistently lowers to insertelement and extractelement
(which it did in most of my experiments). If we want to support this
transform, I think we need to either extend GEP to have an optional
constraint on inbounds indexing within array and/or vector aggregates, or
we need to change the semantics of the existing 'inbounds' keyword to mean
that.

How does your code handle accesses to following union?
>
> union DoubleBits {
>   double x;
>   int y[2];
> };
>

Hmm, interesting test case. It introduces a bitcast that disrupts
everything. Let me look at this and related stuff and try to handle it
reasonably.

>
> I'm sort of worried about allowing SROA so much freedom in terms of
> splitting memset/memcpy; your algorithm can turn a single memcpy into
> an arbitrary number of memcpys (which is a quadratic explosion in
> codesize in the worst case).  Not sure if it's an issue in practice.
>

Yes, this is the primary concern with the new algorithm. That said, the N*M
which explodes requires N overlapping splittable operations which overlap
with M disjoint un-splittable operations. My hope is that this latice
structure is very rare. Even better, whenever we start heavily splitting,
we're getting something out of it -- we're successfully isolating
un-splittable partitions which should form candidates for promotion.

Do you have any ideas for mitigating or bounding the growth that don't
devolve to heuristics?

> Minor nit:
>
>     if (SI.getOperand(0) == U.Ptr)
>       return false;
>
> This is overly tricky; SI.getOperand(1) != U.Ptr, please.
>

Done.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20120821/0082a1f1/attachment.html>