[llvm-commits] [cfe-commits] [PATCH] __builtin_assume_aligned for Clang and LLVM

Sat Dec 1 14:05:06 PST 2012

----- Original Message -----
> From: "Chandler Carruth" <chandlerc at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>, "llvm cfe" <cfe-commits at cs.uiuc.edu>, "Nadav Rotem" <nrotem at apple.com>
> Sent: Saturday, December 1, 2012 2:51:24 AM
> Subject: Re: [cfe-commits] [PATCH] __builtin_assume_aligned for Clang and LLVM
> 
> 
> 
> On Fri, Nov 30, 2012 at 4:14 PM, Hal Finkel < hfinkel at anl.gov >
> wrote:
> 
> 
> 
> 
> Hi everyone,
> 
> Many compilers provide a way, through either pragmas or intrinsics,
> for the user to assert stronger alignment requirements on a pointee
> than is otherwise implied by the pointer's type. gcc now provides an
> intrinsic for this purpose, __builtin_assume_aligned, and the
> attached patches (one for Clang and one for LLVM) implement that
> intrinsic using a corresponding LLVM intrinsic, and provide an
> infrastructure to take advantage of this new information.
> 
> ** BEGIN justification -- skip this if you don't care ;) **
> 
> 
> 
> I'd like to start off by saying, I think this is absolutely an
> important use case. =]
> 
> 
> <snip the attribute / type discussion ... i agree here>
> 
> 
> 
> With the intrinsic, this is easier:
> foo (double *a, double *b) {
> a = __builtin_assume_aligned(a, 16);
> b = __builtin_assume_aligned(b, 16);
> for (int i = 0; i < N; ++i)
> a[i] = b[i] + 1;
> }
> this code can be vectorized with aligned loads and stores, and even
> if it is not vectorized, will remain correct.
> 
> The second problem with the purely type-based approach, is that it
> requires manual loop unrolling an inlining. Because the intrinsics
> are evaluated after inlining (and after loop unrolling), the
> optimizer can use the alignment assumptions specified in the caller
> when generating code for an inlined callee. This is a very important
> capability.
> 
> 
> I think this is going to be a very important point for me to
> understand fully. I see the problem with inlining, but not with
> unrolling. Can you describe what the issue is there?

For example, if I have a 32-byte aligned array of doubles, and I (partially) unroll a loop over the array, then I transform:
for (...; ++i) { load ...(i), align 8; }
into
for (...; i += 4) {
  load ...(i+0), align 8;
  load ...(i+1), align 8;
  load ...(i+2), align 8;
  load ...(i+3), align 8;
}
and then applying alignment assumptions gives:
for (...; i += 4) {
  load ...(i+0), align 32;
  load ...(i+1), align 8;
  load ...(i+2), align 16;
  load ...(i+3), align 8;
}
Once in this form, among other things, BBVectorize and/or the DAGCombiner can form vector loads from the scalar ones.

> 
> 
> <snip>
> 
> 
> 
> Mirroring the gcc (and now Clang) intrinsic, the corresponding LLVM
> intrinsic is:
> <t1>* @llvm.assume.aligned.p<s><t1>.<t2>(<t1>* addr, i32 alignment,
> <int t2> offset)
> which asserts that the address returned is offset bytes above an
> address with the specified alignment. The attached patch makes some
> simple changes to several analysis passes (like BasicAA and SE) to
> allow them to 'look through' the intrinsic. It also adds a
> transformation pass that propagates the alignment assumptions to
> loads and stores directly dependent on the intrinsics's return
> value. Once this is done, the intrinsics are removed so that they
> don't interfere with the remaining optimizations.
> 
> 
> 
> I think this approach has more problems than you're currently dealing
> with. There are lots of passes that will run before loop unrolling
> and which will need to look through pointers. I'm extremely
> uncomfortable adding a *new* way to propagate a pointer from one SSA
> value to another,

Me too. Another option is just to make the intrinsic a user of the relevant pointer. This might be better.

> there are soooo many places in the compiler we
> will have to fix.

I think that I got all of them ;) 

 Your patch already has a very large amount of
> duplicated code scattered about a large collection of passes.

True. I actually may have done more than necessary; for the simple test cases, I really only needed to update SE and BasicAA. I also through in some equivalent updates to CaptureTracking, ValueTracking, Loads, etc. because they seemed like logical places as well.

If I make the intrinsic just a no-capture user of the pointer, that may make all of this noise go away.

> I
> think this is an indication that this design isn't ideal, so I'd
> really like to see if we can come up with a better in-IR
> representation for this which will simplify how the rest of the
> optimizer interacts with it.
> 
> 
> I see a few options off the top of my head, but I don't yet
> understand the problem domain well enough to really dig into any of
> them. Specifically, I understand the transforms you're trying to
> enable, and the reason they aren't currently possible, but I don't
> understand the C/C++ code patterns you are imagining, and where in
> those code patterns this invariant will be introduced. If you can
> help with more examples maybe?
> 
> 
> Ok, on with the straw-man proposals:
> 
> 
> 1) Add support for declaring a pointer parameter's *value* to be
> aligned:
> void foo([[clang::aligned_pointer(16)]] double *a,
> [[clang::aligned_pointer(16)]] double *b) {
> for (int i = 0; i < N; ++i)
> a[i] = b[i] + 1;
> }
> 
> the idea here is that these alignment constraints would be
> substantially different from either the GCC alignment attributes or
> the C++11 alignment attributes. Note that these are actually very
> different, and Clang currently gets the C++11 one wrong.
> GCC alignment attribute: makes the *type* be aligned to the specified
> amount.
> C++11 alignment attribute: makes the *storage* be aligned to the
> specified amount. (I think.. I'll check w/ Richard on Monday)
> 
> 
> Interestingly enough, C++11's definition would almost work for you if
> the inliner flattens to the point at which storage is declared. My
> suggestion is just to be able to also mark a pointer itself at an
> interface boundary as carrying this constraint. It wouldn't change
> the type at all. You could imagine this working much like GCC's
> non-null attribute does.
> 
> 
> The way my suggestion might be implemented in LLVM: an attribute or
> metadata on the function parameters. Then we teach instcombine and
> loop passes to detect when this constraint on alignment can be used
> to promote the alignment of loads and stores. The loop passes can
> detect when unrolling has allowed the stride to exceed the
> alignment, and then promote the load and store alignment, etc.

That might work (although it would not provide gcc's ability to also specify the offset to the aligned address). It might also be more complicated to implement than the present approach with no corresponding benefit. One potential benefit on the frontend side could be type safety, but we'd need to work out the semantics.

> 
> 
> 
> 
> 2) Use an invariant based design much as Alex suggested. The syntax
> for this could be exactly the same as the GCC builtin you mention,
> or the same as my syntax in #1. This representation is I think a
> strictly more generalized LLVM IR model from the others under
> discussion.
> 
> 
> I have two completely off-the-cuff ideas about how to represent this.
> Others who've been thinking about it longer may have better ideas:
> a) Dead code to produce a CmpInst that the invariant asserts is true.
> %ptrtoint = ptrtoint double* %ptr to i64
> %lowbits = and i64 %ptrtoint, 0xF
> %invariant_cmp = icmp eq i64 %lowbits, 0
> call void @llvm.invariant(i1 %invariant_cmp)
> 
> 
> b) Dead code to produce two values which we assert are equivalent.
> Then when trying to optimize one, we can leverage things known about
> the other (such as simplify demanded bits, etc).
> %ptrtoint = ptrtoint double* %ptr to i64
> %maskedint = and i64 %ptrtoint, 0xFFFFFFFFFFFFFFF0
> %maskedptr = inttoptr i64 %maskedint to double* %ptr
> call void @llvm.equivalent(double* %ptr, double* %maskedptr)
> 
> 
> (I realize (b) here isn't really implementable w/ our current
> intrinsic overloading, but you get the idea)
> 
> 
> The difference is -- how do optimizations use them? With (b), the
> optimization can just query N equivalent values for any property. If
> it holds for any, it holds for all. With (a), we have to
> specifically interpret the predicate which is an invariant. While
> (a) seems like a bit of extra work, I actually prefer it because it
> seems more general, easier to understand, and to support fairly
> aggressive caching of invariants, etc.
> 
> 

This is also worth pursuing. Maybe the best way of doing this is to make some 'dead code' metadata. When the frontend generates code specifically for these invariants this code is marked with the 'dead code' metadata. Intructions with this metadata can be ignored by the cost model used by the inliner, loop unroller, etc. The normal cleanup utility functions that delete dead instructions will add this metadata to instructions only used by other instructions with the 'dead code' metadata. Then the selection-DAG building will ignore instructions with this metadata (that way the invariants will still be available during codegen -- if this is judged unnecessary then we can delete them all beforehand).

I imagine that we could enhance SE to return information on these invariants.

> 
> 
> 3) Your proposal, but handled more like __builtin_expect. Have a very
> early pass that converts the intrinsic into metadata attached to a
> particular pointer SSA value. Then teach passes to preserve the
> metadata, and have your pass promote alignment based on it.
> 

My fear is that touching all of the places that make or manipulate GEPs to preserve this metadata would be a large change. Also, it would only be able to deal with constant offsets. I don't know how big a deal it is to provide non-constant offsets, but it is easy to provide that ability with the intrinsic.

> 
> 
> 
> Of these, my preferences in order are #2, #3, and #1. I like #2
> because it would begin building a tremendously powerful and much
> needed piece of infrastructure in LLVM, and serve many needs in
> addition to alignment.

I think that there are two experiments to try:

1. invariants (perhaps with dead-code metadata as I've suggested)
2. making the currently-proposed instrinsic a non-capture user of the address, and not a pass through (this would effectively be the case for the invariant approach as well)

> 
> 
> Thoughts? If you're interested in working on #2, I'd love to talk
> through the design....

I agree that, at least long term, this is the approach with the largest potential payoff. Let's talk ;)

Thanks again,
Hal

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory