[llvm-dev] Aligned vector spills and variably sized stack frames

Fri Aug 28 17:08:28 PDT 2015

----- Original Message -----
> From: "Philip Reames" <listmail at philipreames.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Friday, August 28, 2015 7:03:24 PM
> Subject: Re: [llvm-dev] Aligned vector spills and variably sized stack frames
> 
> 
> 
> On 08/28/2015 04:29 PM, Hal Finkel wrote:
> > ----- Original Message -----
> >> From: "Philip Reames via llvm-dev" <llvm-dev at lists.llvm.org>
> >> To: "llvm-dev" <llvm-dev at lists.llvm.org>
> >> Sent: Friday, August 28, 2015 6:21:00 PM
> >> Subject: Re: [llvm-dev] Aligned vector spills and variably sized
> >> stack frames
> >>
> >> On 08/28/2015 04:00 PM, Philip Reames via llvm-dev wrote:
> >>> I've run into a problem that I'm trying to figure out how to
> >>> address
> >>> and would welcome ideas and feedback.
> >>>
> >>> Today, the vectorizer will nicely vectorize loops using the
> >>> widest
> >>> legal vector type for the target.  On a reasonable recent
> >>> machine,
> >>> this will often end up using AVX2 registers which are 32 bytes
> >>> wide.
> >>>
> >>> If during register allocation, we decide to spill one of these
> >>> registers, we use the vmovaps instruction which requires the
> >>> address
> >>> in memory accessed to be 32 byte aligned.  So far, so good.
> >>>
> >>> However, the C ABI generally only provides 16 bytes of alignment
> >>> for
> >>> the stack on entry to the function.  To work around this, the
> >>> backend
> >>> will create a variable sized frame with a dynamic amount of
> >>> padding
> >>> inserted if required to ensure that a 32 byte aligned spill slot
> >>> is
> >>> available.
> >>>
> >>> The problem I have is that my runtime's ABI really doesn't like
> >>> variably sized frames.  In particular, the assumption that stack
> >>> frames are fixed size - except during prolog and epilogue - is
> >>> fairly
> >>> baked in.
> >>>
> >>> I'm weighing a couple of options for addressing this and want to
> >>> gather feedback on the perceived difficulty of each.  If someone
> >>> has
> >>> another approach, I'm also very open to that.
> >>>
> >>> Option 1 - Fix my runtime to not expect mostly fixed size frames.
> >>> This
> >>> isn't a small change to make, but given it's a strictly internal
> >>> ABI,
> >>> I can probably get away with doing it.  Given things like
> >>> shrink-wrapping are coming down the pipe, it might also have
> >>> secondary
> >>> benefits.  However, this is a relatively risky change to make for
> >>> a
> >>> fairly corner case.
> >>>
> >>> Option 1a - I could change my ABI to use a 32 byte aligned frame.
> >>> This
> >>> has many of the same problems as (1).
> >>>
> >>> Option 2 - Don't compile things which need to spill vector
> >>> registers.
> >>> This is actually what we do today and has worked out fairly well
> >>> in
> >>> practice.  This is what I'm hoping to move away from.
> >>>
> >>> Option 3 - Add an option in the x86 backend to not require
> >>> aligned
> >>> spill slots for AVX2 registers.  In particular, the VMOVUPS
> >>> instruction can be used to spill vector registers into an 8 or 16
> >>> byte
> >>> aligned spill slot and not require dynamic frame realignment.
> >>> This
> >>> seems like it might be useful in other context as well, but I
> >>> can't
> >>> name any at the moment.
> >>>
> >>> One thing that occurs to me is that many spills are down rare
> >>> paths.
> >>> Maybe it would make sense to only do dynamic alignment for hot
> >>> spill/reloads?  We could then simply override the heustic to
> >>> always
> >>> use unaligned spills.
> >>>
> >>> I don't really have a sense for how hard (3) would be to
> >>> implement.
> >>> Anyone have an intuition?
> >> After sending this, I did another search and promptly discovered
> >> the
> >> existing "no-realign-stack" function attribute which seems to do
> >> exactly
> >> what I need.  Anyone know if this is robust?
> > I believe this works correctly, but is not a targeted fix for the
> > AVX spilling problem. ;) -- and I can certainly imagine such a
> > feature being generally desirable. Specifically, all overaligned
> > locals will simply fail to be overaligned (and, thus, the
> > resulting code will likely be broken). In your case, I can imagine
> > you can simply promise never to create such things, and you'll be
> > fine.
> To restate, you're saying that if I had a load or store with
> alignment
> greater than the native frame size, that using this option might
> cause
> that alignment not to be respected? 

No, what I'm saying is that if you were to create an alloca instruction with an alignment specified to be greater than the ABI stack alignment, and you use no-realign-stack to disable all stack realignment, then the resulting stack slot may simply not have the requested alignment.

 -Hal

> That would work in practice, but
> I
> should probably solve this in a more principled way to avoid future
> pain.  However, given your comments and the existing attribute,
> implementing something along the lines of my option (3) above
> shouldn't
> be too hard.  I'll likely post a patch in that direction next week.
> 
> Thanks for the guidance.
> 
> Philip
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory