[PATCH] D76570: [AArch64] Homogeneous Prolog and Epilog for Size Optimization

Wed Jan 6 14:40:11 PST 2021

kyulee marked 7 inline comments as done.
kyulee added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64LowerHomogeneousPrologEpilog.cpp:215-217
+      .addMemOperand(
+          MF.getMachineMemOperand(MachinePointerInfo::getUnknownStack(MF),
+                                  MachineMemOperand::MOStore, 8, Align(8)))
----------------
t.p.northover wrote:
> What does adding two identical mem-operands achieve? I assumed it would be treated identically to just one.
In fact, these mem-operands are not needed so I just delete them.

================
Comment at: llvm/lib/Target/AArch64/AArch64LowerHomogeneousPrologEpilog.cpp:305
+    assert(LRIter != Regs.end());
+    int SPAdjust = Size - std::distance(Regs.begin(), LRIter) - 2;
+    // Save the the first CSR with pre-increment of SP.
----------------
t.p.northover wrote:
> This calculation is quite dense and the dynamic sequence of stores we're aiming for isn't really explicit in any one location (comment at the top of the file?). I had to copy/paste it into a main.cpp and run some test cases before I really got what it was trying to do.
> 
> Also, and I think this is part of the same problem, it's not obvious that we even save all registers we need to. The loop looks like it's going to skip the last two elements unless the `if` is executed, and I had to really think to convince myself that matched the case where the last two are LR/FP and so already saved.
> 
> Sitcking to the separate `emitStore` calls it might be better as
> 
>     // If the register stored to the lowest address is not LR, we must subtract more from SP here.
>     if (LRIdx != Size - 2)
>       emitStore(...);
> 
> though there's an argument for merging them
> 
>     // If LR is not stored at the lowest address (i.e. the final pair in the list) then
>     // the caller will only have subtracted part of the callee save area from SP. 
>     // Our first store must do the rest.
>     int RemainingSPAdjustment = Size - 2 - LRIdx;
> 
>     for (int I = Size - 2; I >= 0; I -= 2) {
>       if (Regs[I] == AArch64::LR)
>         continue;
>       emitStore(..., Regs[I], Regs[I+1], Size - 2 - I - RemainingSPAdjustment, RemainingSPAdjustment != 0);
>       RemainingSPAdjustment = 0;
>     }
I took the former approach for the readability even though the latter merges all these code.

================
Comment at: llvm/lib/Target/AArch64/AArch64LowerHomogeneousPrologEpilog.cpp:353
+      // Branch on X16 not to trash LR.
+      BuildMI(MBB, MBB.end(), DebugLoc(), TII.get(AArch64::BR))
+          .addUse(AArch64::X16);
----------------
t.p.northover wrote:
> I know we're not hugely concerned about performance (the indexing is minor, for example), but this seems to misalign the BL/RET chain that CPUs often hard-code into their branch predictors. I seem to remember reading that's an absolutely horrible thing to do to a CPU.
> 
> Fortunately, the native `AArch64::RET` instruction does actually take a register operand (it's only difference from `BR` is precisely that it's a hint to the CPU that you're popping from any microarchitectural return-address stack it may be keeping). So probably best to use that.
Nice to know this! I used RET instead of BR.

================
Comment at: llvm/lib/Target/AArch64/AArch64LowerHomogeneousPrologEpilog.cpp:367
+/// @return a valid non-negative adjustment. Or -1 for any other case.
+int getFpAdjustmentFromSp(MachineBasicBlock::iterator &MBBI) {
+  MachineInstr &MI = *MBBI;
----------------
t.p.northover wrote:
> Is this really a dynamic property? The frame layout is pretty solidly dictated by the ABI so the offset is known from the list of registers we're saving.
> 
> Even if it is dynamic, perhaps it's best to encode into `HOM_prolog` as an operand rather than re-deriving it by grubbing around the basic block.
I just passed FPOffset as additional argument to HOM_Prolog. Indeed your suggestion cleans up the code quite a bit. Thanks!

================
Comment at: llvm/lib/Target/AArch64/AArch64LowerHomogeneousPrologEpilog.cpp:406
+  // # of instructions that will be outlined.
+  int InstCount = RegCount >> 1;
+
----------------
smeenai wrote:
> t.p.northover wrote:
> > Let the compiler convert a divide to a shift, it's not the 80s.
> If you're compiling without assertions, the compiler wouldn't be able to just generate a shift directly (because it has to account for negative numbers), right? You'd need something like a `__builtin_assume` to get just the shift to be generated. (See https://godbolt.org/z/vzx9hx)
Thanks for the point. As shown in https://godbolt.org/z/Wo5r4W, I use a division with unsigned type for readability, but it becomes a shift.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D76570/new/

https://reviews.llvm.org/D76570