[PATCH] D76570: [AArch64] Homogeneous Prolog and Epilog for Size Optimization

Thu Oct 1 10:30:29 PDT 2020

efriedma added a comment.

Should this be controlled by the clang -moutline/-mno-outline?

================
Comment at: llvm/lib/Target/AArch64/AArch64FrameLowering.cpp:240
+  if (Exit && getArgumentPopSize(MF, *Exit))
+    return false;
+  return true;
----------------
kyulee wrote:
> efriedma wrote:
> > A long list of random conditions like this is asking for trouble: someone is inevitably going to miss some case where it needs to be updated in the future.  Is this really the only way?
> This patch is meant to find a regular/homogeneous stack frame as possible that can be easily outlined for the size. In theory, all these random conditions/restrictions can be relaxed with more thorough implementation, but I think this may be good to start with. Let me know how I can improve this.
I'm not sure how hasVarSizedObjects() or needsStackRealignment() are relevant here.

For the other bits, I guess they're directly related to what you're doing.  Maybe leave comments in the relevant parts of prologue and epilogue lowering noting that they need to stay in sync with homogeneousPrologEpilog.

================
Comment at: llvm/test/CodeGen/AArch64/arm64-homogeneous-prolog-epilog-frame-tail.ll:71
+; CHECK-SAVELR:      mov     x16, x30
+; CHECK-SAVELR-NEXT: ldp     x29, x30, [sp], #16
+; CHECK-SAVELR-NEXT: stp     x20, x19, [sp, #-16]!
----------------
kyulee wrote:
> efriedma wrote:
> > If we're going to save fp/lr in the caller, can't we just save them to the correct position, instead of copying them to a new location in the outlined function?
> > 
> > Also, in general, pre-increment instructions are more expensive then non-pre-increment instructions.  It would be better to rearrange the operations so you can emit exactly one pre-increment instruction, instead of making every instruction pre-increment.
> I agree this is not ideal for the Linux case which has the different (opposite) order of CSR than Darwin/iOS which does not require this pattern (see the above without SAVELR case). The initial version disallowed this while supporting compact unwind case (Darwin) only but from @dmgreen suggestion, I added this Linux support, but tries to keep as the same pattern as possible instead of introducing different shape at the call-site. This is an important aspect when we want to extend this homogeneous function entry for other purpose (like instrumentation, etc) which actually I'm working on in another context.
> 
> I understand pre/post-inc/dec uses are more expensive, and this code can be optimized with additional complexity.
> The goal is to minimize the code by exposing the homogeneous patterns and outlining them at the cost of performance -- in fact, outlining itself may hurt perf significantly on a CPU bound case although page-fault or working set is improved for the large binary with this aggressive size optimization.
> @efriedma Can I follow up the specialized/optimized outlined function after this patch? Or I can work on the improved version in place now.
> 
If fixing this is going to substantially change the way the implementation looks, I'd rather not review it twice.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D76570/new/

https://reviews.llvm.org/D76570