[PATCH] D76570: [AArch64] Homogeneous Prolog and Epilog for Size Optimization

Wed Nov 11 05:46:44 PST 2020

t.p.northover added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64LowerHomogeneousPrologEpilog.cpp:215-217
+      .addMemOperand(
+          MF.getMachineMemOperand(MachinePointerInfo::getUnknownStack(MF),
+                                  MachineMemOperand::MOStore, 8, Align(8)))
----------------
What does adding two identical mem-operands achieve? I assumed it would be treated identically to just one.

================
Comment at: llvm/lib/Target/AArch64/AArch64LowerHomogeneousPrologEpilog.cpp:303
+    // Compute the remaining SP adjust beyond FP/LR.
+    auto LRIter = std::find(Regs.begin(), Regs.end(), AArch64::LR);
+    assert(LRIter != Regs.end());
----------------
I think it would be clearer if the `LRIter` became `LRIdx`, folding the std::distance thing into the same line as the `find` (or by another means). We don't really use it as an iterator and spreading the calculation out confuses me.

================
Comment at: llvm/lib/Target/AArch64/AArch64LowerHomogeneousPrologEpilog.cpp:305
+    assert(LRIter != Regs.end());
+    int SPAdjust = Size - std::distance(Regs.begin(), LRIter) - 2;
+    // Save the the first CSR with pre-increment of SP.
----------------
This calculation is quite dense and the dynamic sequence of stores we're aiming for isn't really explicit in any one location (comment at the top of the file?). I had to copy/paste it into a main.cpp and run some test cases before I really got what it was trying to do.

Also, and I think this is part of the same problem, it's not obvious that we even save all registers we need to. The loop looks like it's going to skip the last two elements unless the `if` is executed, and I had to really think to convince myself that matched the case where the last two are LR/FP and so already saved.

Sitcking to the separate `emitStore` calls it might be better as

    // If the register stored to the lowest address is not LR, we must subtract more from SP here.
    if (LRIdx != Size - 2)
      emitStore(...);

though there's an argument for merging them

    // If LR is not stored at the lowest address (i.e. the final pair in the list) then
    // the caller will only have subtracted part of the callee save area from SP. 
    // Our first store must do the rest.
    int RemainingSPAdjustment = Size - 2 - LRIdx;

    for (int I = Size - 2; I >= 0; I -= 2) {
      if (Regs[I] == AArch64::LR)
        continue;
      emitStore(..., Regs[I], Regs[I+1], Size - 2 - I - RemainingSPAdjustment, RemainingSPAdjustment != 0);
      RemainingSPAdjustment = 0;
    }

================
Comment at: llvm/lib/Target/AArch64/AArch64LowerHomogeneousPrologEpilog.cpp:353
+      // Branch on X16 not to trash LR.
+      BuildMI(MBB, MBB.end(), DebugLoc(), TII.get(AArch64::BR))
+          .addUse(AArch64::X16);
----------------
I know we're not hugely concerned about performance (the indexing is minor, for example), but this seems to misalign the BL/RET chain that CPUs often hard-code into their branch predictors. I seem to remember reading that's an absolutely horrible thing to do to a CPU.

Fortunately, the native `AArch64::RET` instruction does actually take a register operand (it's only difference from `BR` is precisely that it's a hint to the CPU that you're popping from any microarchitectural return-address stack it may be keeping). So probably best to use that.

================
Comment at: llvm/lib/Target/AArch64/AArch64LowerHomogeneousPrologEpilog.cpp:367
+/// @return a valid non-negative adjustment. Or -1 for any other case.
+int getFpAdjustmentFromSp(MachineBasicBlock::iterator &MBBI) {
+  MachineInstr &MI = *MBBI;
----------------
Is this really a dynamic property? The frame layout is pretty solidly dictated by the ABI so the offset is known from the list of registers we're saving.

Even if it is dynamic, perhaps it's best to encode into `HOM_prolog` as an operand rather than re-deriving it by grubbing around the basic block.

================
Comment at: llvm/lib/Target/AArch64/AArch64LowerHomogeneousPrologEpilog.cpp:406
+  // # of instructions that will be outlined.
+  int InstCount = RegCount >> 1;
+
----------------
Let the compiler convert a divide to a shift, it's not the 80s.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D76570/new/

https://reviews.llvm.org/D76570