[cfe-commits] [PATCH] __builtin_assume_aligned for Clang and LLVM

Sat Dec 1 00:51:24 PST 2012

On Fri, Nov 30, 2012 at 4:14 PM, Hal Finkel <hfinkel at anl.gov> wrote:

> Hi everyone,
>
> Many compilers provide a way, through either pragmas or intrinsics, for
> the user to assert stronger alignment requirements on a pointee than is
> otherwise implied by the pointer's type. gcc now provides an intrinsic for
> this purpose, __builtin_assume_aligned, and the attached patches (one for
> Clang and one for LLVM) implement that intrinsic using a corresponding LLVM
> intrinsic, and provide an infrastructure to take advantage of this new
> information.
>
> ** BEGIN justification -- skip this if you don't care ;) **
>

I'd like to start off by saying, I think this is absolutely an important
use case. =]

With the intrinsic, this is easier:
> foo (double *a, double *b) {
>   a = __builtin_assume_aligned(a, 16);
>   b = __builtin_assume_aligned(b, 16);
>   for (int i = 0; i < N; ++i)
>     a[i] = b[i] + 1;
> }
> this code can be vectorized with aligned loads and stores, and even if it
> is not vectorized, will remain correct.
>
> The second problem with the purely type-based approach, is that it
> requires manual loop unrolling an inlining. Because the intrinsics are
> evaluated after inlining (and after loop unrolling), the optimizer can use
> the alignment assumptions specified in the caller when generating code for
> an inlined callee. This is a very important capability.

I think this is going to be a very important point for me to understand
fully. I see the problem with inlining, but not with unrolling. Can you
describe what the issue is there?

<snip>

Mirroring the gcc (and now Clang) intrinsic, the corresponding LLVM
> intrinsic is:
> <t1>* @llvm.assume.aligned.p<s><t1>.<t2>(<t1>* addr, i32 alignment, <int
> t2> offset)
> which asserts that the address returned is offset bytes above an address
> with the specified alignment. The attached patch makes some simple changes
> to several analysis passes (like BasicAA and SE) to allow them to 'look
> through' the intrinsic. It also adds a transformation pass that propagates
> the alignment assumptions to loads and stores directly dependent on the
> intrinsics's return value. Once this is done, the intrinsics are removed so
> that they don't interfere with the remaining optimizations.
>

I think this approach has more problems than you're currently dealing with.
There are lots of passes that will run before loop unrolling and which will
need to look through pointers. I'm extremely uncomfortable adding a *new*
way to propagate a pointer from one SSA value to another, there are soooo
many places in the compiler we will have to fix. Your patch already has a
very large amount of duplicated code scattered about a large collection of
passes. I think this is an indication that this design isn't ideal, so I'd
really like to see if we can come up with a better in-IR representation for
this which will simplify how the rest of the optimizer interacts with it.

I see a few options off the top of my head, but I don't yet understand the
problem domain well enough to really dig into any of them. Specifically, I
understand the transforms you're trying to enable, and the reason they
aren't currently possible, but I don't understand the C/C++ code patterns
you are imagining, and where in those code patterns this invariant will be
introduced. If you can help with more examples maybe?

Ok, on with the straw-man proposals:

1) Add support for declaring a pointer parameter's *value* to be aligned:
void foo([[clang::aligned_pointer(16)]] double *a,
         [[clang::aligned_pointer(16)]] double *b) {
  for (int i = 0; i < N; ++i)
    a[i] = b[i] + 1;
}
the idea here is that these alignment constraints would be substantially
different from either the GCC alignment attributes or the C++11 alignment
attributes. Note that these are actually very different, and Clang
currently gets the C++11 one wrong.
GCC alignment attribute: makes the *type* be aligned to the specified
amount.
C++11 alignment attribute: makes the *storage* be aligned to the specified
amount. (I think.. I'll check w/ Richard on Monday)

Interestingly enough, C++11's definition would almost work for you if the
inliner flattens to the point at which storage is declared. My suggestion
is just to be able to also mark a pointer itself at an interface boundary
as carrying this constraint. It wouldn't change the type at all. You could
imagine this working much like GCC's non-null attribute does.

The way my suggestion might be implemented in LLVM: an attribute or
metadata on the function parameters. Then we teach instcombine and loop
passes to detect when this constraint on alignment can be used to promote
the alignment of loads and stores. The loop passes can detect when
unrolling has allowed the stride to exceed the alignment, and then promote
the load and store alignment, etc.

2) Use an invariant based design much as Alex suggested. The syntax for
this could be exactly the same as the GCC builtin you mention, or the same
as my syntax in #1. This representation is I think a strictly more
generalized LLVM IR model from the others under discussion.

I have two completely off-the-cuff ideas about how to represent this.
Others who've been thinking about it longer may have better ideas:
a) Dead code to produce a CmpInst that the invariant asserts is true.
%ptrtoint = ptrtoint double* %ptr to i64
%lowbits = and i64 %ptrtoint, 0xF
%invariant_cmp = icmp eq i64 %lowbits, 0
call void @llvm.invariant(i1 %invariant_cmp)

b) Dead code to produce two values which we assert are equivalent. Then
when trying to optimize one, we can leverage things known about the other
(such as simplify demanded bits, etc).
%ptrtoint = ptrtoint double* %ptr to i64
%maskedint = and i64 %ptrtoint, 0xFFFFFFFFFFFFFFF0
%maskedptr = inttoptr i64 %maskedint to double* %ptr
call void @llvm.equivalent(double* %ptr, double* %maskedptr)

(I realize (b) here isn't really implementable w/ our current intrinsic
overloading, but you get the idea)

The difference is -- how do optimizations use them? With (b), the
optimization can just query N equivalent values for any property. If it
holds for any, it holds for all. With (a), we have to specifically
interpret the predicate which is an invariant. While (a) seems like a bit
of extra work, I actually prefer it because it seems more general, easier
to understand, and to support fairly aggressive caching of invariants, etc.

3) Your proposal, but handled more like __builtin_expect. Have a very early
pass that converts the intrinsic into metadata attached to a particular
pointer SSA value. Then teach passes to preserve the metadata, and have
your pass promote alignment based on it.

Of these, my preferences in order are #2, #3, and #1. I like #2 because it
would begin building a tremendously powerful and much needed piece of
infrastructure in LLVM, and serve many needs in addition to alignment.

Thoughts? If you're interested in working on #2, I'd love to talk through
the design....
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20121201/a64a44bf/attachment.html>