[PATCH] D76570: [AArch64] Homogeneous Prolog and Epilog for Size Optimization

Thu Oct 1 09:21:52 PDT 2020

kyulee added a comment.

@efriedma Unlike other work that uses a helper call to runtime library, this work synthesizes outlined functions with a custom-calling convention. This is handy and flexible to extend this logic for other patterns or purposes like instrumentation, etc inside the helper within the compiler.

================
Comment at: llvm/lib/Target/AArch64/AArch64FrameLowering.cpp:240
+  if (Exit && getArgumentPopSize(MF, *Exit))
+    return false;
+  return true;
----------------
efriedma wrote:
> A long list of random conditions like this is asking for trouble: someone is inevitably going to miss some case where it needs to be updated in the future.  Is this really the only way?
This patch is meant to find a regular/homogeneous stack frame as possible that can be easily outlined for the size. In theory, all these random conditions/restrictions can be relaxed with more thorough implementation, but I think this may be good to start with. Let me know how I can improve this.

================
Comment at: llvm/test/CodeGen/AArch64/arm64-homogeneous-prolog-epilog-frame-tail.ll:71
+; CHECK-SAVELR:      mov     x16, x30
+; CHECK-SAVELR-NEXT: ldp     x29, x30, [sp], #16
+; CHECK-SAVELR-NEXT: stp     x20, x19, [sp, #-16]!
----------------
efriedma wrote:
> If we're going to save fp/lr in the caller, can't we just save them to the correct position, instead of copying them to a new location in the outlined function?
> 
> Also, in general, pre-increment instructions are more expensive then non-pre-increment instructions.  It would be better to rearrange the operations so you can emit exactly one pre-increment instruction, instead of making every instruction pre-increment.
I agree this is not ideal for the Linux case which has the different (opposite) order of CSR than Darwin/iOS which does not require this pattern (see the above without SAVELR case). The initial version disallowed this while supporting compact unwind case (Darwin) only but from @dmgreen suggestion, I added this Linux support, but tries to keep as the same pattern as possible instead of introducing different shape at the call-site. This is an important aspect when we want to extend this homogeneous function entry for other purpose (like instrumentation, etc) which actually I'm working on in another context.

I understand pre/post-inc/dec uses are more expensive, and this code can be optimized with additional complexity.
The goal is to minimize the code by exposing the homogeneous patterns and outlining them at the cost of performance -- in fact, outlining itself may hurt perf significantly on a CPU bound case although page-fault or working set is improved for the large binary with this aggressive size optimization.
@efriedma Can I follow up the specialized/optimized outlined function after this patch? Or I can work on the improved version in place now.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D76570/new/

https://reviews.llvm.org/D76570