[cfe-commits] [llvm-commits] [PATCH] __builtin_assume_aligned for Clang and LLVM

Sat Dec 1 17:24:40 PST 2012

----- Original Message -----
> From: "Erik Schwiebert" <eriksc at microsoft.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm cfe" <cfe-commits at cs.uiuc.edu>
> Sent: Saturday, December 1, 2012 6:40:38 PM
> Subject: Re: [cfe-commits] [llvm-commits] [PATCH] __builtin_assume_aligned	for Clang and LLVM
> 
> I'm not familiar very familiar with gcc's practices; I know that
> clang currently allows using the 'aligned' attribute to weaken the
> alignment of the base type of a pointer:
> 
> 	typedef __attribute__((aligned(1))) double weakly_aligned_double;
> 	weakly_aligned_double *wd;
> 	*wd = 1.0; // will not use asm instructions that require 8-byte
> 	alignment
> 
> Microsoft is (sadly, in my personal) using this to support legacy
> file formats where data is packed very tightly and the code written
> for the era assumes it can weaken pointers safely. It's probably a
> good thing to not add new ways to violate the language specs. :) I
> would definitely prefer a warning rather than silently doing nothing
> if the code requests an alignment weaker than the native type
> requires.
> 
> It would be good to check against alignment of structures being
> passed in as well.
> 
> 	typedef struct
> 	{
> 	double d;
> 	int i;
> 	} MyStruct;
> 
> 	MyStruct *m;
> 	m = __builtin_assume_aligned(m, 4); // reject because the struct
> 	needs to be aligned 8 for the internal double.

I'm not sure whether or not this should be an error. One could "get around" the error by casting the returned pointer to a different type with a weaker type alignment. I agree that it should at least be a warning.

Thanks again,
Hal 

> 
> Thanks,
> Schwieb
> 
> On Nov 30, 2012, at 9:41 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
> > ----- Original Message -----
> >> From: "Erik Schwiebert" <eriksc at microsoft.com>
> >> To: "Alex Rosenberg" <alexr at leftfield.org>, "Hal Finkel"
> >> <hfinkel at anl.gov>
> >> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>, "llvm cfe"
> >> <cfe-commits at cs.uiuc.edu>
> >> Sent: Friday, November 30, 2012 11:16:18 PM
> >> Subject: Re: [cfe-commits] [llvm-commits] [PATCH]
> >> __builtin_assume_aligned	for Clang and LLVM
> >> 
> >> What happens if someone uses the intrinsic to *weaken* the
> >> alignment
> >> of the given type?  i.e.,
> >> 
> >> 	double *d;
> >> 	d = __builtin_assume_aligned(d, 1);
> >> 
> >> C99 6.3.2.3p7 and C++11 [expr.reinterpret.cast]p17 say it is
> >> undefined (an issue with which I am all too familiar with these
> >> days).  Should the compiler reject the code or allow it?  I note
> >> that the compiler allows __attribute__((aligned(1))), so this
> >> should
> >> probably also be allowed but anyone using it must proceed with
> >> caution.
> > 
> > The code added to LLVM will ignore this assumption. I did this in
> > order to follow (what I believe to be) gcc's convention: alignment
> > attributes and intrinsics can only strengthen alignment
> > requirements while 'packing' attributes can weaken them. Do you
> > think that this is the right approach for us to follow? In your
> > example, Clang could certainly issue a warning. That is probably a
> > good thing for it to do.
> > 
> > I forgot to mention this in the documentation; thanks for pointing
> > this out.
> > 
> > -Hal
> > 
> >> 
> >> Schwieb
> >> 
> >> On Nov 30, 2012, at 6:44 PM, Alex Rosenberg <alexr at leftfield.org>
> >> wrote:
> >> 
> >>> I'd love a more general assume mechanism that other optimizations
> >>> could use. e.g. alignment would simply be an available (x & mask)
> >>> expression for the suitable passes to take advantage of.
> >>> 
> >>> Sent from my iPad
> >>> 
> >>> On Nov 30, 2012, at 4:14 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> >>> 
> >>>> Hi everyone,
> >>>> 
> >>>> Many compilers provide a way, through either pragmas or
> >>>> intrinsics, for the user to assert stronger alignment
> >>>> requirements on a pointee than is otherwise implied by the
> >>>> pointer's type. gcc now provides an intrinsic for this purpose,
> >>>> __builtin_assume_aligned, and the attached patches (one for
> >>>> Clang
> >>>> and one for LLVM) implement that intrinsic using a corresponding
> >>>> LLVM intrinsic, and provide an infrastructure to take advantage
> >>>> of this new information.
> >>>> 
> >>>> ** BEGIN justification -- skip this if you don't care ;) **
> >>>> First, let me provide some justification. It is currently
> >>>> possible
> >>>> in Clang, using gcc-style (or C++11-style) attributes, to create
> >>>> typedefs with stronger alignment requirements than the original
> >>>> type. This is a useful feature, but it has shortcomings. First,
> >>>> for the purpose of allowing the compiler to create vectorized
> >>>> code with aligned loads and stores, they are awkward to use, and
> >>>> even more-awkward to use correctly. For example, if I have as a
> >>>> base case:
> >>>> foo (double *a, double *b) {
> >>>> for (int i = 0; i < N; ++i)
> >>>>  a[i] = b[i] + 1;
> >>>> }
> >>>> and I want to say that a and b are both 16-byte aligned, I can
> >>>> write instead:
> >>>> typedef double __attribute__((aligned(16))) double16;
> >>>> foo (double16 *a, double16 *b) {
> >>>> for (int i = 0; i < N; ++i)
> >>>>  a[i] = b[i] + 1;
> >>>> }
> >>>> and this might work; the loads and stores will be tagged as
> >>>> 16-byte aligned, and we can vectorize the loop into, for
> >>>> example,
> >>>> a loop over <2 x double>. The problem is that the code is now
> >>>> incorrect: it implies that *all* of the loads and stores are
> >>>> 16-byte aligned, and this is not true. Only every-other one is
> >>>> 16-byte aligned. It is possible to correct this problem by
> >>>> manually unrolling the loop by a factor of 2:
> >>>> foo (double16 *a, double16 *b) {
> >>>> for (int i = 0; i < N; i += 2) {
> >>>>  a[i] = b[i] + 1;
> >>>>  ((double *) a)[i+1] = ((double *) b)[i+1] + 1;
> >>>> }
> >>>> }
> >>>> but this is awkward and error-prone.
> >>>> 
> >>>> With the intrinsic, this is easier:
> >>>> foo (double *a, double *b) {
> >>>> a = __builtin_assume_aligned(a, 16);
> >>>> b = __builtin_assume_aligned(b, 16);
> >>>> for (int i = 0; i < N; ++i)
> >>>>  a[i] = b[i] + 1;
> >>>> }
> >>>> this code can be vectorized with aligned loads and stores, and
> >>>> even if it is not vectorized, will remain correct.
> >>>> 
> >>>> The second problem with the purely type-based approach, is that
> >>>> it
> >>>> requires manual loop unrolling an inlining. Because the
> >>>> intrinsics are evaluated after inlining (and after loop
> >>>> unrolling), the optimizer can use the alignment assumptions
> >>>> specified in the caller when generating code for an inlined
> >>>> callee. This is a very important capability.
> >>>> 
> >>>> The need to apply the alignment assumptions after inlining and
> >>>> loop unrolling necessitate placing most of the infrastructure
> >>>> for
> >>>> this into LLVM; with Clang only generating LLVM intrinsics. In
> >>>> addition, to take full advantage of the information provided, it
> >>>> is necessary to look at loop-dependent pointer offsets and
> >>>> strides; ScalarEvoltution provides the appropriate framework for
> >>>> doing this.
> >>>> ** END justification **
> >>>> 
> >>>> Mirroring the gcc (and now Clang) intrinsic, the corresponding
> >>>> LLVM intrinsic is:
> >>>> <t1>* @llvm.assume.aligned.p<s><t1>.<t2>(<t1>* addr, i32
> >>>> alignment, <int t2> offset)
> >>>> which asserts that the address returned is offset bytes above an
> >>>> address with the specified alignment. The attached patch makes
> >>>> some simple changes to several analysis passes (like BasicAA and
> >>>> SE) to allow them to 'look through' the intrinsic. It also adds
> >>>> a
> >>>> transformation pass that propagates the alignment assumptions to
> >>>> loads and stores directly dependent on the intrinsics's return
> >>>> value. Once this is done, the intrinsics are removed so that
> >>>> they
> >>>> don't interfere with the remaining optimizations.
> >>>> 
> >>>> The patches are attached. I've also uploaded these to
> >>>> llvm-reviews
> >>>> (this is my first time trying this, so please let me know if I
> >>>> should do something differently):
> >>>> Clang - http://llvm-reviews.chandlerc.com/D149
> >>>> LLVM - http://llvm-reviews.chandlerc.com/D150
> >>>> 
> >>>> Please review.
> >>>> 
> >>>> Nadav, One shortcoming of the current patch is that, while it
> >>>> will
> >>>> work to vectorize loops using unroll+bb-vectorize, it will not
> >>>> automatically work with the loop vectorizer. To really be
> >>>> effective, the transformation pass needs to run after loop
> >>>> unrolling; and loop unrolling is (and should be) run after loop
> >>>> vectorization. Even if run prior to loop vectorization, it would
> >>>> not directly help the loop vectorizer because the necessary
> >>>> strided loads and stores don't yet exist. As a second step, I
> >>>> think we should split the current transformation pass into a
> >>>> transformation pass and an analysis pass. This analysis pass can
> >>>> then be used by the loop vectorizer (and any other early passes
> >>>> that want the information) before the final rewriting and
> >>>> intrinsic deletion is done.
> 
> 
> 
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory