[cfe-commits] [llvm-commits] [PATCH] __builtin_assume_aligned for Clang and LLVM

Fri Nov 30 18:44:31 PST 2012

I'd love a more general assume mechanism that other optimizations could use. e.g. alignment would simply be an available (x & mask) expression for the suitable passes to take advantage of.

Sent from my iPad

On Nov 30, 2012, at 4:14 PM, Hal Finkel <hfinkel at anl.gov> wrote:

> Hi everyone,
> 
> Many compilers provide a way, through either pragmas or intrinsics, for the user to assert stronger alignment requirements on a pointee than is otherwise implied by the pointer's type. gcc now provides an intrinsic for this purpose, __builtin_assume_aligned, and the attached patches (one for Clang and one for LLVM) implement that intrinsic using a corresponding LLVM intrinsic, and provide an infrastructure to take advantage of this new information.
> 
> ** BEGIN justification -- skip this if you don't care ;) **
> First, let me provide some justification. It is currently possible in Clang, using gcc-style (or C++11-style) attributes, to create typedefs with stronger alignment requirements than the original type. This is a useful feature, but it has shortcomings. First, for the purpose of allowing the compiler to create vectorized code with aligned loads and stores, they are awkward to use, and even more-awkward to use correctly. For example, if I have as a base case:
> foo (double *a, double *b) {
>  for (int i = 0; i < N; ++i)
>    a[i] = b[i] + 1;
> }
> and I want to say that a and b are both 16-byte aligned, I can write instead:
> typedef double __attribute__((aligned(16))) double16;
> foo (double16 *a, double16 *b) {
>  for (int i = 0; i < N; ++i)
>    a[i] = b[i] + 1;
> }
> and this might work; the loads and stores will be tagged as 16-byte aligned, and we can vectorize the loop into, for example, a loop over <2 x double>. The problem is that the code is now incorrect: it implies that *all* of the loads and stores are 16-byte aligned, and this is not true. Only every-other one is 16-byte aligned. It is possible to correct this problem by manually unrolling the loop by a factor of 2:
> foo (double16 *a, double16 *b) {
>  for (int i = 0; i < N; i += 2) {
>    a[i] = b[i] + 1;
>    ((double *) a)[i+1] = ((double *) b)[i+1] + 1;
>  }
> }
> but this is awkward and error-prone.
> 
> With the intrinsic, this is easier:
> foo (double *a, double *b) {
>  a = __builtin_assume_aligned(a, 16);
>  b = __builtin_assume_aligned(b, 16);
>  for (int i = 0; i < N; ++i)
>    a[i] = b[i] + 1;
> }
> this code can be vectorized with aligned loads and stores, and even if it is not vectorized, will remain correct.
> 
> The second problem with the purely type-based approach, is that it requires manual loop unrolling an inlining. Because the intrinsics are evaluated after inlining (and after loop unrolling), the optimizer can use the alignment assumptions specified in the caller when generating code for an inlined callee. This is a very important capability.
> 
> The need to apply the alignment assumptions after inlining and loop unrolling necessitate placing most of the infrastructure for this into LLVM; with Clang only generating LLVM intrinsics. In addition, to take full advantage of the information provided, it is necessary to look at loop-dependent pointer offsets and strides; ScalarEvoltution provides the appropriate framework for doing this.
> ** END justification **
> 
> Mirroring the gcc (and now Clang) intrinsic, the corresponding LLVM intrinsic is:
> <t1>* @llvm.assume.aligned.p<s><t1>.<t2>(<t1>* addr, i32 alignment, <int t2> offset)
> which asserts that the address returned is offset bytes above an address with the specified alignment. The attached patch makes some simple changes to several analysis passes (like BasicAA and SE) to allow them to 'look through' the intrinsic. It also adds a transformation pass that propagates the alignment assumptions to loads and stores directly dependent on the intrinsics's return value. Once this is done, the intrinsics are removed so that they don't interfere with the remaining optimizations.
> 
> The patches are attached. I've also uploaded these to llvm-reviews (this is my first time trying this, so please let me know if I should do something differently):
> Clang - http://llvm-reviews.chandlerc.com/D149
> LLVM - http://llvm-reviews.chandlerc.com/D150
> 
> Please review.
> 
> Nadav, One shortcoming of the current patch is that, while it will work to vectorize loops using unroll+bb-vectorize, it will not automatically work with the loop vectorizer. To really be effective, the transformation pass needs to run after loop unrolling; and loop unrolling is (and should be) run after loop vectorization. Even if run prior to loop vectorization, it would not directly help the loop vectorizer because the necessary strided loads and stores don't yet exist. As a second step, I think we should split the current transformation pass into a transformation pass and an analysis pass. This analysis pass can then be used by the loop vectorizer (and any other early passes that want the information) before the final rewriting and intrinsic deletion is done. 
> 
> Thanks again,
> Hal
> 
> -- 
> Hal Finkel
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
> <asal-clang-20121130.patch>
> <asal-llvm-20121130.patch>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits