[PATCH] D152828: [MachineSink][AArch64] Sink instruction copies when they can replace copy into hard register or folded into addressing mode

Fri Sep 8 08:22:43 PDT 2023

chill marked 2 inline comments as done.
chill added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64InstrInfo.cpp:2926
+      return false;
+    // Shift 1 (scale 2) in address is one extra cycle and one extra unit on
+    // some CPUs.
----------------
dmgreen wrote:
> I believe it is shifts of 1 or 4 that would be more expensive for OoO cores, but the other shift types are also cheap. AddrLSLFast means any addressing mode with a LSL with shift <= 3 are cheap. ALULSLFast means adds/subs with LSL<=4 are fast.
> 
> I think the logic should be similar to that in DAGCombine (ignoring register pressure for a moment). If we are optimizing for size or there are no other uses the fold should be beneficial. Otherwise we treat it as cheap if we have AddrLSLFast and the shift is <= 3. An ADDXrs could take 2 cycles anyway so could be more aggressive?
> 
> Does this take into account the number of uses, and should it? Should it fold more under Optsize?
Yeah, that makes sense.  We already perform the transformation only if the original instruction is removed (i.e. can be folded into all of its users), if we are also optimizing for size, then we can ignore here the (minor) increase on the cycle count, if that would increase the chances for the transformation happening.

================
Comment at: llvm/lib/Target/AArch64/AArch64InstrInfo.cpp:2863
+    if (Shift == 1 && !Subtarget.hasLSLFast())
+      return false;
+    return canFoldAddRegIntoAddrMode(1 << Shift);
----------------
dmgreen wrote:
> I think it's worth clarifying how these should work with LSLFast and addressing operands. I had an explanation of how I thought LSLFast should work in https://reviews.llvm.org/D155470#4527270. Let me know if you think doesn't sound right. There is a patch to split LSLFast into multiple parts as a first step in https://reviews.llvm.org/D157982.
I guess we can be more precise by distinguishing between cores where shifts 0, 1, 2, and 3 are fast
and cores where just shifts 0, 2, and 3 are fast. Not sure if it's worth adding an extra feature. Lacking such a feature, for now the code considers shift 1 to be slow by default, but it could just as well consider shift 1 to be fast by default.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152828/new/

https://reviews.llvm.org/D152828