[cfe-commits] [llvm-commits] [PATCH] __builtin_assume_aligned for Clang and LLVM

Fri Nov 30 21:41:56 PST 2012

----- Original Message -----
> From: "Erik Schwiebert" <eriksc at microsoft.com>
> To: "Alex Rosenberg" <alexr at leftfield.org>, "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>, "llvm cfe" <cfe-commits at cs.uiuc.edu>
> Sent: Friday, November 30, 2012 11:16:18 PM
> Subject: Re: [cfe-commits] [llvm-commits] [PATCH] __builtin_assume_aligned	for Clang and LLVM
> 
> What happens if someone uses the intrinsic to *weaken* the alignment
> of the given type?  i.e.,
> 
> 	double *d;
> 	d = __builtin_assume_aligned(d, 1);
> 
> C99 6.3.2.3p7 and C++11 [expr.reinterpret.cast]p17 say it is
> undefined (an issue with which I am all too familiar with these
> days).  Should the compiler reject the code or allow it?  I note
> that the compiler allows __attribute__((aligned(1))), so this should
> probably also be allowed but anyone using it must proceed with
> caution.

The code added to LLVM will ignore this assumption. I did this in order to follow (what I believe to be) gcc's convention: alignment attributes and intrinsics can only strengthen alignment requirements while 'packing' attributes can weaken them. Do you think that this is the right approach for us to follow? In your example, Clang could certainly issue a warning. That is probably a good thing for it to do.

I forgot to mention this in the documentation; thanks for pointing this out.

 -Hal

> 
> Schwieb
> 
> On Nov 30, 2012, at 6:44 PM, Alex Rosenberg <alexr at leftfield.org>
> wrote:
> 
> > I'd love a more general assume mechanism that other optimizations
> > could use. e.g. alignment would simply be an available (x & mask)
> > expression for the suitable passes to take advantage of.
> > 
> > Sent from my iPad
> > 
> > On Nov 30, 2012, at 4:14 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> > 
> >> Hi everyone,
> >> 
> >> Many compilers provide a way, through either pragmas or
> >> intrinsics, for the user to assert stronger alignment
> >> requirements on a pointee than is otherwise implied by the
> >> pointer's type. gcc now provides an intrinsic for this purpose,
> >> __builtin_assume_aligned, and the attached patches (one for Clang
> >> and one for LLVM) implement that intrinsic using a corresponding
> >> LLVM intrinsic, and provide an infrastructure to take advantage
> >> of this new information.
> >> 
> >> ** BEGIN justification -- skip this if you don't care ;) **
> >> First, let me provide some justification. It is currently possible
> >> in Clang, using gcc-style (or C++11-style) attributes, to create
> >> typedefs with stronger alignment requirements than the original
> >> type. This is a useful feature, but it has shortcomings. First,
> >> for the purpose of allowing the compiler to create vectorized
> >> code with aligned loads and stores, they are awkward to use, and
> >> even more-awkward to use correctly. For example, if I have as a
> >> base case:
> >> foo (double *a, double *b) {
> >> for (int i = 0; i < N; ++i)
> >>   a[i] = b[i] + 1;
> >> }
> >> and I want to say that a and b are both 16-byte aligned, I can
> >> write instead:
> >> typedef double __attribute__((aligned(16))) double16;
> >> foo (double16 *a, double16 *b) {
> >> for (int i = 0; i < N; ++i)
> >>   a[i] = b[i] + 1;
> >> }
> >> and this might work; the loads and stores will be tagged as
> >> 16-byte aligned, and we can vectorize the loop into, for example,
> >> a loop over <2 x double>. The problem is that the code is now
> >> incorrect: it implies that *all* of the loads and stores are
> >> 16-byte aligned, and this is not true. Only every-other one is
> >> 16-byte aligned. It is possible to correct this problem by
> >> manually unrolling the loop by a factor of 2:
> >> foo (double16 *a, double16 *b) {
> >> for (int i = 0; i < N; i += 2) {
> >>   a[i] = b[i] + 1;
> >>   ((double *) a)[i+1] = ((double *) b)[i+1] + 1;
> >> }
> >> }
> >> but this is awkward and error-prone.
> >> 
> >> With the intrinsic, this is easier:
> >> foo (double *a, double *b) {
> >> a = __builtin_assume_aligned(a, 16);
> >> b = __builtin_assume_aligned(b, 16);
> >> for (int i = 0; i < N; ++i)
> >>   a[i] = b[i] + 1;
> >> }
> >> this code can be vectorized with aligned loads and stores, and
> >> even if it is not vectorized, will remain correct.
> >> 
> >> The second problem with the purely type-based approach, is that it
> >> requires manual loop unrolling an inlining. Because the
> >> intrinsics are evaluated after inlining (and after loop
> >> unrolling), the optimizer can use the alignment assumptions
> >> specified in the caller when generating code for an inlined
> >> callee. This is a very important capability.
> >> 
> >> The need to apply the alignment assumptions after inlining and
> >> loop unrolling necessitate placing most of the infrastructure for
> >> this into LLVM; with Clang only generating LLVM intrinsics. In
> >> addition, to take full advantage of the information provided, it
> >> is necessary to look at loop-dependent pointer offsets and
> >> strides; ScalarEvoltution provides the appropriate framework for
> >> doing this.
> >> ** END justification **
> >> 
> >> Mirroring the gcc (and now Clang) intrinsic, the corresponding
> >> LLVM intrinsic is:
> >> <t1>* @llvm.assume.aligned.p<s><t1>.<t2>(<t1>* addr, i32
> >> alignment, <int t2> offset)
> >> which asserts that the address returned is offset bytes above an
> >> address with the specified alignment. The attached patch makes
> >> some simple changes to several analysis passes (like BasicAA and
> >> SE) to allow them to 'look through' the intrinsic. It also adds a
> >> transformation pass that propagates the alignment assumptions to
> >> loads and stores directly dependent on the intrinsics's return
> >> value. Once this is done, the intrinsics are removed so that they
> >> don't interfere with the remaining optimizations.
> >> 
> >> The patches are attached. I've also uploaded these to llvm-reviews
> >> (this is my first time trying this, so please let me know if I
> >> should do something differently):
> >> Clang - http://llvm-reviews.chandlerc.com/D149
> >> LLVM - http://llvm-reviews.chandlerc.com/D150
> >> 
> >> Please review.
> >> 
> >> Nadav, One shortcoming of the current patch is that, while it will
> >> work to vectorize loops using unroll+bb-vectorize, it will not
> >> automatically work with the loop vectorizer. To really be
> >> effective, the transformation pass needs to run after loop
> >> unrolling; and loop unrolling is (and should be) run after loop
> >> vectorization. Even if run prior to loop vectorization, it would
> >> not directly help the loop vectorizer because the necessary
> >> strided loads and stores don't yet exist. As a second step, I
> >> think we should split the current transformation pass into a
> >> transformation pass and an analysis pass. This analysis pass can
> >> then be used by the loop vectorizer (and any other early passes
> >> that want the information) before the final rewriting and
> >> intrinsic deletion is done.
> >> 
> >> Thanks again,
> >> Hal
> >> 
> >> --
> >> Hal Finkel
> >> Postdoctoral Appointee
> >> Leadership Computing Facility
> >> Argonne National Laboratory
> >> <asal-clang-20121130.patch>
> >> <asal-llvm-20121130.patch>
> >> _______________________________________________
> >> llvm-commits mailing list
> >> llvm-commits at cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > 
> > _______________________________________________
> > cfe-commits mailing list
> > cfe-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
> > 
> > 
> 
> 
> 
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory