[llvm-dev] Aligned vector spills and variably sized stack frames

Fri Aug 28 17:03:24 PDT 2015

On 08/28/2015 04:29 PM, Hal Finkel wrote:
> ----- Original Message -----
>> From: "Philip Reames via llvm-dev" <llvm-dev at lists.llvm.org>
>> To: "llvm-dev" <llvm-dev at lists.llvm.org>
>> Sent: Friday, August 28, 2015 6:21:00 PM
>> Subject: Re: [llvm-dev] Aligned vector spills and variably sized stack frames
>>
>> On 08/28/2015 04:00 PM, Philip Reames via llvm-dev wrote:
>>> I've run into a problem that I'm trying to figure out how to
>>> address
>>> and would welcome ideas and feedback.
>>>
>>> Today, the vectorizer will nicely vectorize loops using the widest
>>> legal vector type for the target.  On a reasonable recent machine,
>>> this will often end up using AVX2 registers which are 32 bytes
>>> wide.
>>>
>>> If during register allocation, we decide to spill one of these
>>> registers, we use the vmovaps instruction which requires the
>>> address
>>> in memory accessed to be 32 byte aligned.  So far, so good.
>>>
>>> However, the C ABI generally only provides 16 bytes of alignment
>>> for
>>> the stack on entry to the function.  To work around this, the
>>> backend
>>> will create a variable sized frame with a dynamic amount of padding
>>> inserted if required to ensure that a 32 byte aligned spill slot is
>>> available.
>>>
>>> The problem I have is that my runtime's ABI really doesn't like
>>> variably sized frames.  In particular, the assumption that stack
>>> frames are fixed size - except during prolog and epilogue - is
>>> fairly
>>> baked in.
>>>
>>> I'm weighing a couple of options for addressing this and want to
>>> gather feedback on the perceived difficulty of each.  If someone
>>> has
>>> another approach, I'm also very open to that.
>>>
>>> Option 1 - Fix my runtime to not expect mostly fixed size frames.
>>> This
>>> isn't a small change to make, but given it's a strictly internal
>>> ABI,
>>> I can probably get away with doing it.  Given things like
>>> shrink-wrapping are coming down the pipe, it might also have
>>> secondary
>>> benefits.  However, this is a relatively risky change to make for a
>>> fairly corner case.
>>>
>>> Option 1a - I could change my ABI to use a 32 byte aligned frame.
>>> This
>>> has many of the same problems as (1).
>>>
>>> Option 2 - Don't compile things which need to spill vector
>>> registers.
>>> This is actually what we do today and has worked out fairly well in
>>> practice.  This is what I'm hoping to move away from.
>>>
>>> Option 3 - Add an option in the x86 backend to not require aligned
>>> spill slots for AVX2 registers.  In particular, the VMOVUPS
>>> instruction can be used to spill vector registers into an 8 or 16
>>> byte
>>> aligned spill slot and not require dynamic frame realignment. This
>>> seems like it might be useful in other context as well, but I can't
>>> name any at the moment.
>>>
>>> One thing that occurs to me is that many spills are down rare
>>> paths.
>>> Maybe it would make sense to only do dynamic alignment for hot
>>> spill/reloads?  We could then simply override the heustic to always
>>> use unaligned spills.
>>>
>>> I don't really have a sense for how hard (3) would be to implement.
>>> Anyone have an intuition?
>> After sending this, I did another search and promptly discovered the
>> existing "no-realign-stack" function attribute which seems to do
>> exactly
>> what I need.  Anyone know if this is robust?
> I believe this works correctly, but is not a targeted fix for the AVX spilling problem. ;) -- and I can certainly imagine such a feature being generally desirable. Specifically, all overaligned locals will simply fail to be overaligned (and, thus, the resulting code will likely be broken). In your case, I can imagine you can simply promise never to create such things, and you'll be fine.
To restate, you're saying that if I had a load or store with alignment 
greater than the native frame size, that using this option might cause 
that alignment not to be respected?  That would work in practice, but I 
should probably solve this in a more principled way to avoid future 
pain.  However, given your comments and the existing attribute, 
implementing something along the lines of my option (3) above shouldn't 
be too hard.  I'll likely post a patch in that direction next week.

Thanks for the guidance.

Philip