<div style="font-family: arial, helvetica, sans-serif; font-size: 10pt"><div dir="ltr">On Fri, Nov 30, 2012 at 4:14 PM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank" class="cremed">hfinkel@anl.gov</a>></span> wrote:<br>
<div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Hi everyone,<br>
<br>
Many compilers provide a way, through either pragmas or intrinsics, for the user to assert stronger alignment requirements on a pointee than is otherwise implied by the pointer's type. gcc now provides an intrinsic for this purpose, __builtin_assume_aligned, and the attached patches (one for Clang and one for LLVM) implement that intrinsic using a corresponding LLVM intrinsic, and provide an infrastructure to take advantage of this new information.<br>
<br>
** BEGIN justification -- skip this if you don't care ;) **<br></blockquote><div><br></div><div>I'd like to start off by saying, I think this is absolutely an important use case. =]</div><div><br></div><div style>
<snip the attribute / type discussion ... i agree here></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
With the intrinsic, this is easier:<br>
foo (double *a, double *b) {<br>
a = __builtin_assume_aligned(a, 16);<br>
b = __builtin_assume_aligned(b, 16);<br>
for (int i = 0; i < N; ++i)<br>
a[i] = b[i] + 1;<br>
}<br>
this code can be vectorized with aligned loads and stores, and even if it is not vectorized, will remain correct.<br>
<br>
The second problem with the purely type-based approach, is that it requires manual loop unrolling an inlining. Because the intrinsics are evaluated after inlining (and after loop unrolling), the optimizer can use the alignment assumptions specified in the caller when generating code for an inlined callee. This is a very important capability.</blockquote>
<div><br></div><div style>I think this is going to be a very important point for me to understand fully. I see the problem with inlining, but not with unrolling. Can you describe what the issue is there?</div><div><br></div>
<div style><snip></div><div style><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Mirroring the gcc (and now Clang) intrinsic, the corresponding LLVM intrinsic is:<br>
<t1>* @llvm.assume.aligned.p<s><t1>.<t2>(<t1>* addr, i32 alignment, <int t2> offset)<br>
which asserts that the address returned is offset bytes above an address with the specified alignment. The attached patch makes some simple changes to several analysis passes (like BasicAA and SE) to allow them to 'look through' the intrinsic. It also adds a transformation pass that propagates the alignment assumptions to loads and stores directly dependent on the intrinsics's return value. Once this is done, the intrinsics are removed so that they don't interfere with the remaining optimizations.<br>
</blockquote><div><br></div><div style>I think this approach has more problems than you're currently dealing with. There are lots of passes that will run before loop unrolling and which will need to look through pointers. I'm extremely uncomfortable adding a *new* way to propagate a pointer from one SSA value to another, there are soooo many places in the compiler we will have to fix. Your patch already has a very large amount of duplicated code scattered about a large collection of passes. I think this is an indication that this design isn't ideal, so I'd really like to see if we can come up with a better in-IR representation for this which will simplify how the rest of the optimizer interacts with it.</div>
<div style><br></div><div style>I see a few options off the top of my head, but I don't yet understand the problem domain well enough to really dig into any of them. Specifically, I understand the transforms you're trying to enable, and the reason they aren't currently possible, but I don't understand the C/C++ code patterns you are imagining, and where in those code patterns this invariant will be introduced. If you can help with more examples maybe?</div>
<div style><br></div><div style>Ok, on with the straw-man proposals:</div><div style><br></div><div style>1) Add support for declaring a pointer parameter's *value* to be aligned:</div><div style><font face="courier new, monospace">void foo([[clang::aligned_pointer(16)]] double *a,</font></div>
<div style><font face="courier new, monospace"> [[clang::aligned_pointer(16)]] double *b) {<br> for (int i = 0; i < N; ++i)<br> a[i] = b[i] + 1;<br>}<br></font></div><div style>the idea here is that these alignment constraints would be substantially different from either the GCC alignment attributes or the C++11 alignment attributes. Note that these are actually very different, and Clang currently gets the C++11 one wrong.</div>
<div style>GCC alignment attribute: makes the *type* be aligned to the specified amount.</div><div style>C++11 alignment attribute: makes the *storage* be aligned to the specified amount. (I think.. I'll check w/ Richard on Monday)</div>
<div style><br></div><div style>Interestingly enough, C++11's definition would almost work for you if the inliner flattens to the point at which storage is declared. My suggestion is just to be able to also mark a pointer itself at an interface boundary as carrying this constraint. It wouldn't change the type at all. You could imagine this working much like GCC's non-null attribute does.</div>
<div style><br></div><div style>The way my suggestion might be implemented in LLVM: an attribute or metadata on the function parameters. Then we teach instcombine and loop passes to detect when this constraint on alignment can be used to promote the alignment of loads and stores. The loop passes can detect when unrolling has allowed the stride to exceed the alignment, and then promote the load and store alignment, etc.</div>
<div style><br></div><div style><br></div><div style>2) Use an invariant based design much as Alex suggested. The syntax for this could be exactly the same as the GCC builtin you mention, or the same as my syntax in #1. This representation is I think a strictly more generalized LLVM IR model from the others under discussion.</div>
<div style><br></div><div style>I have two completely off-the-cuff ideas about how to represent this. Others who've been thinking about it longer may have better ideas:</div><div style>a) Dead code to produce a CmpInst that the invariant asserts is true.</div>
<div style>%ptrtoint = ptrtoint double* %ptr to i64</div><div style>%lowbits = and i64 %ptrtoint, 0xF</div><div style>%invariant_cmp = icmp eq i64 %lowbits, 0</div><div style>call void @llvm.invariant(i1 %invariant_cmp)</div>
<div style><br></div><div style>b) Dead code to produce two values which we assert are equivalent. Then when trying to optimize one, we can leverage things known about the other (such as simplify demanded bits, etc).</div>
<div style>%ptrtoint = ptrtoint double* %ptr to i64</div><div>%maskedint = and i64 %ptrtoint, 0xFFFFFFFFFFFFFFF0</div><div style>%maskedptr = inttoptr i64 %maskedint to double* %ptr</div><div style>call void @llvm.equivalent(double* %ptr, double* %maskedptr)</div>
<div style><br></div><div style>(I realize (b) here isn't really implementable w/ our current intrinsic overloading, but you get the idea)</div><div style><br></div><div style>The difference is -- how do optimizations use them? With (b), the optimization can just query N equivalent values for any property. If it holds for any, it holds for all. With (a), we have to specifically interpret the predicate which is an invariant. While (a) seems like a bit of extra work, I actually prefer it because it seems more general, easier to understand, and to support fairly aggressive caching of invariants, etc.</div>
<div style><br></div><div><br></div><div style>3) Your proposal, but handled more like __builtin_expect. Have a very early pass that converts the intrinsic into metadata attached to a particular pointer SSA value. Then teach passes to preserve the metadata, and have your pass promote alignment based on it.</div>
<div style><br></div><div style><br></div><div style>Of these, my preferences in order are #2, #3, and #1. I like #2 because it would begin building a tremendously powerful and much needed piece of infrastructure in LLVM, and serve many needs in addition to alignment.</div>
<div style><br></div><div style>Thoughts? If you're interested in working on #2, I'd love to talk through the design....</div></div></div></div></div>