[PATCH] D157279: [CodeGen] Disable LD1RX instructions generation for Neoverse-V1

Mon Aug 7 07:11:46 PDT 2023

paulwalker-arm added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64.td:151

+def FeatureAvoidLD1R : SubtargetFeature<"avoid-ld1r",
+  "AvoidLD1R", "true", "Prefer LDR(LDP)+MOV ove LD1RX">;
----------------
This doesn't accurately represent the "feature" you want to model.

Firstly the effect is specific to SVE rather than wanting to avoid all uses of ld1r (plus below I reason that we also might want to keep the integer variants as is).

Secondly the issue is that on Neoverse V1 there are fewer LS pipelines for SVE than NEON.  This is not normally a problem because the SVE registers are twice the size of NEON and so the overall bandwidth is greater.  However, when loading 128-bit or smaller datatypes the bandwidth switches in favour of NEON with its extra LS pipe. (Noting that on V1 the latency of LD1R is the same are LDR+DUP)

The choice is yours but as a minimum, and based on agreement regarding the integer variants, I'd be happy with "sve-avoid-fp-ld1r" but if there's a nice way to sum up the second point above then that'll be perfect.

================
Comment at: llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td:2334

-  // LDR1 of 8-bit data
-  defm : LD1RPat<nxv16i8, extloadi8,  LD1RB_IMM,    PTRUE_B, i32, am_indexed8_6b, uimm6s1>;
-  defm : LD1RPat<nxv8i16, zextloadi8, LD1RB_H_IMM,  PTRUE_H, i32, am_indexed8_6b, uimm6s1>;
-  defm : LD1RPat<nxv4i32, zextloadi8, LD1RB_S_IMM,  PTRUE_S, i32, am_indexed8_6b, uimm6s1>;
-  defm : LD1RPat<nxv2i64, zextloadi8, LD1RB_D_IMM,  PTRUE_D, i64, am_indexed8_6b, uimm6s1>;
-  defm : LD1RPat<nxv8i16, sextloadi8, LD1RSB_H_IMM, PTRUE_H, i32, am_indexed8_6b, uimm6s1>;
-  defm : LD1RPat<nxv4i32, sextloadi8, LD1RSB_S_IMM, PTRUE_S, i32, am_indexed8_6b, uimm6s1>;
-  defm : LD1RPat<nxv2i64, sextloadi8, LD1RSB_D_IMM, PTRUE_D, i64, am_indexed8_6b, uimm6s1>;
-
-  // LDR1 of 16-bit data
-  defm : LD1RPat<nxv8i16, extloadi16,  LD1RH_IMM,    PTRUE_H, i32, am_indexed16_6b, uimm6s2>;
-  defm : LD1RPat<nxv4i32, zextloadi16, LD1RH_S_IMM,  PTRUE_S, i32, am_indexed16_6b, uimm6s2>;
-  defm : LD1RPat<nxv2i64, zextloadi16, LD1RH_D_IMM,  PTRUE_D, i64, am_indexed16_6b, uimm6s2>;
-  defm : LD1RPat<nxv4i32, sextloadi16, LD1RSH_S_IMM, PTRUE_S, i32, am_indexed16_6b, uimm6s2>;
-  defm : LD1RPat<nxv2i64, sextloadi16, LD1RSH_D_IMM, PTRUE_D, i64, am_indexed16_6b, uimm6s2>;
-
-  // LDR1 of 32-bit data
-  defm : LD1RPat<nxv4i32, load,        LD1RW_IMM,   PTRUE_S, i32, am_indexed32_6b, uimm6s4>;
-  defm : LD1RPat<nxv2i64, zextloadi32, LD1RW_D_IMM, PTRUE_D, i64, am_indexed32_6b, uimm6s4>;
-  defm : LD1RPat<nxv2i64, sextloadi32, LD1RSW_IMM,  PTRUE_D, i64, am_indexed32_6b, uimm6s4>;
-
-  // LDR1 of 64-bit data
-  defm : LD1RPat<nxv2i64, load, LD1RD_IMM, PTRUE_D, i64, am_indexed64_6b, uimm6s8>;
-
-  // LD1R of FP data
-  defm : LD1RPat<nxv8f16, load, LD1RH_IMM,   PTRUE_H, f16, am_indexed16_6b, uimm6s2>;
-  defm : LD1RPat<nxv4f16, load, LD1RH_S_IMM, PTRUE_S, f16, am_indexed16_6b, uimm6s2>;
-  defm : LD1RPat<nxv2f16, load, LD1RH_D_IMM, PTRUE_D, f16, am_indexed16_6b, uimm6s2>;
-  defm : LD1RPat<nxv4f32, load, LD1RW_IMM,   PTRUE_S, f32, am_indexed32_6b, uimm6s4>;
-  defm : LD1RPat<nxv2f32, load, LD1RW_D_IMM, PTRUE_D, f32, am_indexed32_6b, uimm6s4>;
-  defm : LD1RPat<nxv2f64, load, LD1RD_IMM,   PTRUE_D, f64, am_indexed64_6b, uimm6s8>;
+  let Predicates = [UseLD1R] in {
+    // LDR1 of 8-bit data
----------------
I believe this should be restricted to only the floating point patterns.  The issue relates to instruction bandwidth and the GPR variants of DUP on V1 go down a single pipe and thus are likely worse than the LS bandwidth issue you're trying to prevent.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D157279/new/

https://reviews.llvm.org/D157279