[cfe-commits] [llvm-commits] [PATCH] __builtin_assume_aligned for Clang and LLVM

Fri Nov 30 18:35:52 PST 2012

----- Original Message -----
> From: "Shuxin Yang" <shuxin.llvm at gmail.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>, "llvm cfe" <cfe-commits at cs.uiuc.edu>
> Sent: Friday, November 30, 2012 8:26:41 PM
> Subject: Re: [llvm-commits] [PATCH] __builtin_assume_aligned for Clang and LLVM
> 
> 
> There is 3rd option:-)
> 
> typedef double __attribute__((aligned(16))) double16;
> void foo (double16 *a, double16 *b, int N) {
> double *p = (double16*)a;
> double *q = (double16*)b;
> int i;
> for (i = 0; i < N; ++i)
> p[i] = q[i] + 1;
> }

No, this is not an option ;) -- If you run this though clang you'll see that both the load and store in the loop have 8-byte alignment. The fact that the parameters were declared as pointers to 16-byte-aligned doubles is lost by the time we get to the IR.

 -Hal

> 
> On 11/30/2012 04:14 PM, Hal Finkel wrote:
> 
> 
> Hi everyone,
> 
> Many compilers provide a way, through either pragmas or intrinsics,
> for the user to assert stronger alignment requirements on a pointee
> than is otherwise implied by the pointer's type. gcc now provides an
> intrinsic for this purpose, __builtin_assume_aligned, and the
> attached patches (one for Clang and one for LLVM) implement that
> intrinsic using a corresponding LLVM intrinsic, and provide an
> infrastructure to take advantage of this new information.
> 
> ** BEGIN justification -- skip this if you don't care ;) **
> First, let me provide some justification. It is currently possible in
> Clang, using gcc-style (or C++11-style) attributes, to create
> typedefs with stronger alignment requirements than the original
> type. This is a useful feature, but it has shortcomings. First, for
> the purpose of allowing the compiler to create vectorized code with
> aligned loads and stores, they are awkward to use, and even
> more-awkward to use correctly. For example, if I have as a base
> case:
> foo (double *a, double *b) {
>   for (int i = 0; i < N; ++i)
>     a[i] = b[i] + 1;
> }
> and I want to say that a and b are both 16-byte aligned, I can write
> instead:
> typedef double __attribute__((aligned(16))) double16;
> foo (double16 *a, double16 *b) {
>   for (int i = 0; i < N; ++i)
>     a[i] = b[i] + 1;
> }
> and this might work; the loads and stores will be tagged as 16-byte
> aligned, and we can vectorize the loop into, for example, a loop
> over <2 x double>. The problem is that the code is now incorrect: it
> implies that *all* of the loads and stores are 16-byte aligned, and
> this is not true. Only every-other one is 16-byte aligned. It is
> possible to correct this problem by manually unrolling the loop by a
> factor of 2:
> foo (double16 *a, double16 *b) {
>   for (int i = 0; i < N; i += 2) {
>     a[i] = b[i] + 1;
>     ((double *) a)[i+1] = ((double *) b)[i+1] + 1;
>   }
> }
> but this is awkward and error-prone.
> 
> With the intrinsic, this is easier:
> foo (double *a, double *b) {
>   a = __builtin_assume_aligned(a, 16);
>   b = __builtin_assume_aligned(b, 16);
>   for (int i = 0; i < N; ++i)
>     a[i] = b[i] + 1;
> }
> this code can be vectorized with aligned loads and stores, and even
> if it is not vectorized, will remain correct.
> 
> The second problem with the purely type-based approach, is that it
> requires manual loop unrolling an inlining. Because the intrinsics
> are evaluated after inlining (and after loop unrolling), the
> optimizer can use the alignment assumptions specified in the caller
> when generating code for an inlined callee. This is a very important
> capability.
> 
> The need to apply the alignment assumptions after inlining and loop
> unrolling necessitate placing most of the infrastructure for this
> into LLVM; with Clang only generating LLVM intrinsics. In addition,
> to take full advantage of the information provided, it is necessary
> to look at loop-dependent pointer offsets and strides;
> ScalarEvoltution provides the appropriate framework for doing this.
> ** END justification **
> 
> Mirroring the gcc (and now Clang) intrinsic, the corresponding LLVM
> intrinsic is:
> <t1>* @llvm.assume.aligned.p<s><t1>.<t2>(<t1>* addr, i32 alignment,
> <int t2> offset)
> which asserts that the address returned is offset bytes above an
> address with the specified alignment. The attached patch makes some
> simple changes to several analysis passes (like BasicAA and SE) to
> allow them to 'look through' the intrinsic. It also adds a
> transformation pass that propagates the alignment assumptions to
> loads and stores directly dependent on the intrinsics's return
> value. Once this is done, the intrinsics are removed so that they
> don't interfere with the remaining optimizations.
> 
> The patches are attached. I've also uploaded these to llvm-reviews
> (this is my first time trying this, so please let me know if I
> should do something differently):
> Clang - http://llvm-reviews.chandlerc.com/D149 LLVM -
> http://llvm-reviews.chandlerc.com/D150 Please review.
> 
> Nadav, One shortcoming of the current patch is that, while it will
> work to vectorize loops using unroll+bb-vectorize, it will not
> automatically work with the loop vectorizer. To really be effective,
> the transformation pass needs to run after loop unrolling; and loop
> unrolling is (and should be) run after loop vectorization. Even if
> run prior to loop vectorization, it would not directly help the loop
> vectorizer because the necessary strided loads and stores don't yet
> exist. As a second step, I think we should split the current
> transformation pass into a transformation pass and an analysis pass.
> This analysis pass can then be used by the loop vectorizer (and any
> other early passes that want the information) before the final
> rewriting and intrinsic deletion is done.
> 
> Thanks again,
> Hal
> 
> _______________________________________________
> llvm-commits mailing list llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory