[PATCH] D61437: [AArch64] Static (de)allocation of SVE stack objects.

Fri May 10 10:06:25 PDT 2019

efriedma added a comment.

In D61437#1497980 <https://reviews.llvm.org/D61437#1497980>, @sdesmalen wrote:

> Thanks for your suggestions @efriedma!
>
> Just to double check, your suggested layout has the frame-record *after* the callee-saves. The current layout however, puts the frame-record above the callee-saves. Are you suggesting to change that?

Yes, I'm suggesting to rearrange them, to make the fp more useful for accessing SVE spills.

>> I'm not sure I understand why it's important to allocate the SVE spill slots before the CSRs, as opposed to allocating them between the CSRs and the regular locals/spills.
> 
> The current layout for our HPC compiler was a trade-off between getting an efficient implementation for SVE spills/fills on one hand, while keeping in mind a way to limit our downstream debt on the other hand. By keeping the layout as unchanged as possible (i.e. keeping all existing offsets to locals/spills the same, with the exception of stack arguments), we figured this simplified the code and reduced the chance of introducing bugs or regressing performance for accesses to regular stack objects in the presence of any SVE slots (with exception of stack arguments).

If the SVE spill area is below the CSRs, you can leverage the existing checks to handle stack realignment, so I don't think it's that complicated to implement.  But maybe your approach requires changing fewer places.

> I spent some time investigating your suggestion to place the SVE area between the callee-saves and locals/spills and found some things worth noting/considering: 
> 
> - In the presence of an SVE area, the compiler should then no longer use stack-slot scavenging to reuse gaps in the CSR area, because accesses from the SP will be expensive.

I don't think there's ever more than one 8-byte slot; not a great loss.  And if we really wanted to, we could access the slot relative to fp.

> - The compiler will have less flexibility to choose the best base pointer to access a stack-slot, because using the FP to access a non-SVE local/spill will require an extra ADDVL instruction. For large stack-frames, this may incur an overhead (and would probably require the emergency spill slot).

We don't normally use fp anyway, unless the function has dynamic allocations; the legal negative offsets from fp are much smaller than the legal positive offsets from sp.  And if there are dynamic allocations, we often emit a base pointer anyway.

But on a related note, we end up forcing a base pointer in all cases with dynamic allocation and SVE spill slots, which I guess is a potential downside.

> - Allocation of (non-SVE) stack space will always need to happen in separate steps, because it will no longer be possible to allocate the entire stack space in one go and then save the callee-saves from the new SP, because the scalable area is inserted in the middle. Instead, compiler needs to first allocate stack space for callee-saves, store callee-saves, and finally allocate the remaining stack-space. Pre/post-incrementing addressing modes can be used for the first two steps, but I don't know if this would be more expensive than using the regular addressing modes.

On cortex-a57 etc., the performance of pre/post-increment is basically the same as an extra arithmetic instruction, IIRC.  So yes, it's slightly more expensive, but not by a lot.

> - The emergency scavenging will always need to be allocated near the SP (or BP), rather than FP. This is not really a problem, but more something that is different when the stack does not contain any SVE objects.

This is probably a one-line change, since we already do this in cases with stack realignment.

> - We'd need to change the location of the frame-record within the callee-saves. If we do so, we'll probably want to do that regardless of whether the stack contains SVE spills or not to keep the layouts similar. Also the distance between FP and locals/spills would be smaller, which is probably beneficial. According to the AAPCS, the placement of the FrameRecord within the stack frame is unspecified (section 5.2.3 The Frame Pointer). Do you know if the same freedom holds true for iOS and Windows calling conventions?

It doesn't matter on iOS.  On Windows, the document describing unwind data actually claims the frame record is supposed to be allocated after the local variables for functions with dynamic stack allocations, but we currently don't implement that, and we haven't seen any issues.  Maybe there's some interaction between C++ exceptions and dynamic allocation we don't implement correctly?  I haven't really spent any time trying to break it, and dynamic allocations combined with C++ exception handling doesn't really show up in real-world code.

>> 1. the epilogue is cheaper (you don't need an addvl after restoring sp from fp)
> 
> In most cases however, LLVM chooses to restore the stack by incrementing the stack-pointer, even when that is suboptimal (e.g. when the FP is available and restoring the SP by adding sizeof(stack) requires more than 1 add instruction). The exception seems to be when the stack is aligned > 16 bytes and it needs to restore it by using the frame-pointer. Do you know if this behaviour is intentional?

That isn't intentional, I think; probably just nobody noticed.  Stack frames that require more than one instruction are rare, and frames that require more than two basically never happen.

>> 2. it's cheaper to access arguments passed on the stack
> 
> Correct.
> 
>> 3. it's cheaper to access the SVE spill slots: you can arrange for the frame pointer to point to the top of the SVE spill area, and use negative offsets from it to spill/restore SVE registers in a single instruction.
> 
> Note that with the layout proposed in this patch, we can overcome that by extending the `16 byte` frame-record to be `n x 16 bytes <=> sizeof(1 SVE-vec spill)`, and access all SVE objects directly from `FP + 1 + Offset`.

Oh, that's clever, and I guess it's not that expensive.

>> 4. code using frame pointers can be unwound using a non-SVE-aware DWARF unwinder.
> 
> When using a frame-pointer, that is still the case with the proposed layout, because the FP will always point to the frame-record, so it can always easily find the previous FP and LR, and offsets to the (non-SVE) callee-saves will be unchanged.

Sorry, I didn't state this correctly.  The key here would be if code isn't using frame pointers, we could emit a frame pointer for all functions with SVE spill slots, and then get correct unwinding without a SVE-aware unwinder, and without recompiling everything with frame pointers.

>> I guess on current SVE implementations, there isn't any advantage to aligning SVE spill slots more than 16 bytes? And you don't expect that to ever change on future implementations?
> 
> Locals arising from use of the ACLE may be set to a different alignment, but since the ACLE does not allow them being members of structs or arrays, there is probably little value in doing so. One advantage of placing the SVE area as you suggested is that we could easily implement such re-alignment by moving up the alignment gap between the callee-saves and the SVE area.

Yes, that's what I was thinking.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61437/new/

https://reviews.llvm.org/D61437