[PATCH] D61437: [AArch64] Static (de)allocation of SVE stack objects.

Fri May 10 06:33:06 PDT 2019

sdesmalen added a comment.

Thanks for your suggestions @efriedma!

Just to double check, your suggested layout has the frame-record *after* the callee-saves. The current layout however, puts the frame-record above the callee-saves. Are you suggesting to change that?

> I'm not sure I understand why it's important to allocate the SVE spill slots before the CSRs, as opposed to allocating them between the CSRs and the regular locals/spills.

The current layout for our HPC compiler was a trade-off between getting an efficient implementation for SVE spills/fills on one hand, while keeping in mind a way to limit our downstream debt on the other hand. By keeping the layout as unchanged as possible (i.e. keeping all existing offsets to locals/spills the same, with the exception of stack arguments), we figured this simplified the code and reduced the chance of introducing bugs or regressing performance for accesses to regular stack objects in the presence of any SVE slots (with exception of stack arguments).

I spent some time investigating your suggestion to place the SVE area between the callee-saves and locals/spills and found some things worth noting/considering: 

- In the presence of an SVE area, the compiler should then no longer use stack-slot scavenging to reuse gaps in the CSR area, because accesses from the SP will be expensive.
- The compiler will have less flexibility to choose the best base pointer to access a stack-slot, because using the FP to access a non-SVE local/spill will require an extra ADDVL instruction. For large stack-frames, this may incur an overhead (and would probably require the emergency spill slot).
- Allocation of (non-SVE) stack space will always need to happen in separate steps, because it will no longer be possible to allocate the entire stack space in one go and then save the callee-saves from the new SP, because the scalable area is inserted in the middle. Instead, compiler needs to first allocate stack space for callee-saves, store callee-saves, and finally allocate the remaining stack-space. Pre/post-incrementing addressing modes can be used for the first two steps, but I don't know if this would be more expensive than using the regular addressing modes.
- The emergency scavenging will always need to be allocated near the SP (or BP), rather than FP. This is not really a problem, but more something that is different when the stack does not contain any SVE objects.
- We'd need to change the location of the frame-record within the callee-saves. If we do so, we'll probably want to do that regardless of whether the stack contains SVE spills or not to keep the layouts similar. Also the distance between FP and locals/spills would be smaller, which is probably beneficial. According to the AAPCS, the placement of the FrameRecord within the stack frame is unspecified (section 5.2.3 The Frame Pointer). Do you know if the same freedom holds true for iOS and Windows calling conventions?

> 1. the epilogue is cheaper (you don't need an addvl after restoring sp from fp)

In most cases however, LLVM chooses to restore the stack by incrementing the stack-pointer, even when that is suboptimal (e.g. when the FP is available and restoring the SP by adding sizeof(stack) requires more than 1 add instruction). The exception seems to be when the stack is aligned > 16 bytes and it needs to restore it by using the frame-pointer. Do you know if this behaviour is intentional?

> 2. it's cheaper to access arguments passed on the stack

Correct.

> 3. it's cheaper to access the SVE spill slots: you can arrange for the frame pointer to point to the top of the SVE spill area, and use negative offsets from it to spill/restore SVE registers in a single instruction.

Note that with the layout proposed in this patch, we can overcome that by extending the `16 byte` frame-record to be `n x 16 bytes <=> sizeof(1 SVE-vec spill)`, and access all SVE objects directly from `FP + 1 + Offset`.

> 4. code using frame pointers can be unwound using a non-SVE-aware DWARF unwinder.

When using a frame-pointer, that is still the case with the proposed layout, because the FP will always point to the frame-record, so it can always easily find the previous FP and LR, and offsets to the (non-SVE) callee-saves will be unchanged.

> I guess on current SVE implementations, there isn't any advantage to aligning SVE spill slots more than 16 bytes? And you don't expect that to ever change on future implementations?

Locals arising from use of the ACLE may be set to a different alignment, but since the ACLE does not allow them being members of structs or arrays, there is probably little value in doing so. One advantage of placing the SVE area as you suggested is that we could easily implement such re-alignment by moving up the alignment gap between the callee-saves and the SVE area.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61437/new/

https://reviews.llvm.org/D61437