[llvm-commits] PATCH: A new SROA implementation
chandlerc at gmail.com
Mon Aug 20 03:37:02 PDT 2012
Hello all! I've been busy working on a somewhat larger patch than usual,
and Duncan indicated I should get a copy out to the list. This is
essentially the re-thinking of my approach after this discussion:
I want to emphasize this is still a bit rough around the edges. =]
There are several very serious problems with SROA today:
- It is subject to the bane of my existence in optimizations: arbitrary
- It is overly conservative about what all constructs can be SROA-ed.
- The vector-value conversion is layered on top of the splitting logic,
missing many opportunities.
- It splits based the underlying type of the alloca, regardless of how the
alloca is accessed.
- When the splitting fails (often due to thresholds), the vector-value
conversion kicks in for preposterous cases to form "bit vectors" with i1024
and other large types.
Let's talk about the i1024 stuff first. This is actually what motivated me
to start looking at this space. There are a few PRs about how this causes
the backend to essentially give up and go home. In many other cases, such
as essentially every obvious implementation of sha1 or other cryptographic
hashing algorithm, instead the backend makes a heroic effort and recovers
from these i1024 values. The cost is in compile time. For a sha1
implementation recently posted as having compile times 2x that of GCC,
essentially all of the time is spent computing the demanded bits for the
i1024 values flying all over the code. No matter how heroic the backend
gets, analyzing i1024 is fundamentally expensive given the structure of
APInt and various analysis passes. We also can't just disable this aspect
of the SROA pass because *it's the only thing we have* for arrays of more
than 8 elements. This is particularly bad whenever the frontend lowers the
alloca as an 'i8' array, because that means that at 9 bytes, SROA shuts off
except for this conversion pass.
Next let's talk about splitting based on the underlying type. What happens
with C++ unions? Bad code. ;] In many cases, one side of the union ends up
getting nuked as dead code, and by only attending to the *use* of the
memory, we catch this by design. Another interesting aspect of designing
the analysis around the *use* of the memory is that it suddenly applies
equally well to malloc'ed memory.
Ok, so an overview of the design of the now pass.
First, credit to Duncan Sands. He and I were chatting on IRC, and I don't
remember who mentioned it first, but this design is the result of our
The idea is to build a partitioning of the allocated memory by walking the
uses of that memory and observing the region of memory touched by the use.
We can then build mini-use-lists of the operations which actually refer to
each partition. From these we can rewrite each partition as its own alloca
and its users to refer directly to that alloca. The type of the alloca can
be computed by examining the users of the partition, etc. This is
complicated because some uses of memory can be *split*, while others
cannot. Take the canonical example: memcpy. We can split a memcpy into two
or three smaller memcpys at will. However, we cannot (in the general case)
so easily split a load or a store. More importantly, we likely would not
wish to split a load or a store as the point of SROA is to make SSA-able
values. Thus what we really want to do is build a partitioning that
attempts to assign a disjoint partition to each collection of overlapping
unsplittable uses. The splittable uses will then be split as necessary to
satisfy this partitioning.
The result is a pass which can handle arbitrarily large arrays and
structures and split single value loads and stores out of these large
allocas by splitting up any large memcpy or memsets. The code growth is
generally limited by the number of these partitions formed, not by the size
of the original buffer. The number of partitions is a function of the
number of input instructions. This prevents the scaling problems that
initially caused the thresholds in the current pass. However, building
these partitions is not easy. The new pass is significantly larger and has
some rather complex parts. On the whole, I suspect it is worth it to get
the increased power and flexibility.
The other major change is that I have tried to very heavily use InstVisitor
to drive the pass rather than writing out large loops with nested dyn-cast
chains. While the InstVisitor often ends up being significantly more lines
of code, I think the code is much better factored as a consequence.
I've attached a patch file, but really there isn't much of a patch. It's
brand new code. Probably the best way to read it is over on github:
There are still two major missing features:
1) rewriting overlapping accesses which look like element accesses on a
vector to actually be element accesses on a vector
2) adding an SSAUpdater mode for promotion
#1 I'm currently working on.
#2 I would prefer to do in-tree as I would rather have it be obvious as the
code moves from one pass to the other. I don't expect any really
significant changes to this logic.
I know that Ben has already read quite a bit of the implementation code, so
CC-ing him. Any comments or early review much appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 125594 bytes
Desc: not available
More information about the llvm-commits