[PATCH] D155470: [AArch64] LSLFast to fold onto base address by default

Mon Jul 24 03:01:28 PDT 2023

dmgreen added a comment.

Sorry for the delay. I've been looking at something very related recently (folding extends into address operands), so I may have become overly opinionated.

I think that it makes sense to split LSLFast into an address and alu part, but not as a part of this patch. From the optimization guides it then looks like we have 4 case where it is or isn't better to fold into the addressing operands (with multiple uses):

- None. The current default. If it has multiple uses don't fold it.
- LSLFast. The current LSLFast which folds LSL (not extends), and is used for the original cases of LSLFast on Kryo and Falkor.
- Ext23Fast. Shifts of 2 and 3 (Scales of 4 and 8) are cheap, the other two are not. All extends are cheap. Used for all Arm OoO cores.
- AllFast. Everything is cheap. Scales of 2 and 16 along with those above. Used for in order cores I think.

It looks like Ext24Fast should be the default for -mcpu=generic, as a conservative compromise between Ext24Fast and AllFast. (There is a chance that AllFast is better on all cores, but it looks like it takes the load/store pipe for an extra cycle and the ones I tested had an extra cycle latency). This patch seems to essentially switch the default to AllFast, with a target feature that makes LSL1 slow again (but now LSL4)? I like changing the default but I'm not sure we can change it to AllFast for all cpus.

================
Comment at: llvm/lib/Target/AArch64/AArch64.td:386

+def FeatureShiftBy2Slow : SubtargetFeature<
+    "shift-2-slow", "HasSlowShift2", "true",
----------------
Can we add Addr to the name of this feature, to explain that it is about address operands, not add+lsl's. Should we also use Scale2 or Shift1?
>From looking at the optimization guides and what we model in the scheduling model (https://github.com/llvm/llvm-project/blob/f6bdfb0b92690403ceef8c1d58adf7a3a733b543/llvm/lib/Target/AArch64/AArch64SchedNeoverseN1.td#L490), it looks like this should be slower for scale2 and scale16. scale4 and scale8 (and scale1, that one's easy) are fast.

================
Comment at: llvm/lib/Target/AArch64/AArch64.td:774
                                    FeaturePostRAScheduler,
+                                   FeatureShiftBy2Slow,
                                    FeatureFuseAddress]>;
----------------
Should FeatureShiftBy2Slow be slower on A55? I don't see that from the optimization guide.

================
Comment at: llvm/lib/Target/AArch64/AArch64.td:786
                                    "Cortex-A57 ARM processors", [
                                    FeatureFuseAES,
                                    FeatureBalanceFPOps,
----------------
>From what I can see from the optimization guides, almost all AArch64 Arm cpus (except for the Cortex-A510) say that the latency of `Load vector reg, register offset, scale, S/D-form` is a cycle lower than `Load vector reg, register offset, scale, H/Q-form`.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp:1230
                                            MVT::i32);
     if (isWorthFolding(LHS))
       return true;
----------------
Should this and the one below be true too?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D155470/new/

https://reviews.llvm.org/D155470