[PATCH] D151894: [AArch64] Neoverse V2 scheduling model

Mon Jun 5 07:35:04 PDT 2023

rjj marked 2 inline comments as done and an inline comment as not done.
rjj added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64InstrFormats.td:10865

-  def i16_indexed : BaseSIMDIndexedTied<1, U, 1, 0b01, opc,
-                                        FPR16Op, FPR16Op, V128_lo,
-                                        VectorIndexH, asm, ".h", "", "", ".h",
-                                        []> {
+  def v1i16_indexed : BaseSIMDIndexedTied<1, U, 1, 0b01, opc,
+                                          FPR16Op, FPR16Op, V128_lo,
----------------
dmgreen wrote:
> Can you do this in a separate patch, in case it causes problems.
Yep of course, done (D152161).

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td:202
+  let NumMicroOps = 2;
+  let ResourceCycles = [1, 3];  // LDPSW
+}
----------------
dmgreen wrote:
> Should this use the load unit for 3 ResourceCycles, as opposed to being pipelined?
You are right, changed to `SchedWriteRes<[V2UnitI, V2UnitL, V2UnitL, V2UnitL]>`.

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td:896
+
+def V2Write_LdrHQ : SchedWriteVariant<[
+                      SchedVar<NeoverseHQForm,  [V2Write_7cyc_1I_1L]>,
----------------
dmgreen wrote:
> Can you explain where the differences between h/q and the other sizes come from?
It's from the software optimisation guide, https://developer.arm.com/documentation/PJDOC-466751330-593177/r0p2 p. 24.

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td:985
+// ALU, basic, flagset
+def : SchedAlias<WriteI,     V2Write_1cyc_1I>;
+
----------------
huntergr wrote:
> The flag setting variants use the 'F' pipelines rather than 'I'. The others do use 'I' though, so perhaps a predicate would work here.
Thanks, I've updated the model to use the 'F' pipelines in the cases you pointed out. Though I have a question: according to the SOG the throughput of these instructions is 3 instead of 4, even though there are 4 pipelines available. Do you have any idea why, or how we could accurately model this?

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td:1026-1027
+// SDIV, UDIV
+def : SchedAlias<WriteID32,  V2Write_12cyc_1M0>;
+def : SchedAlias<WriteID64,  V2Write_20cyc_1M0>;
+
----------------
dmgreen wrote:
> 12 and 20 are worst-case times. Would a value more in the middle of the range be better?
Sure, so maybe 8 and 12 respectively? Do you have a better suggestion? What about the throughput, 1/8 and 1/12?

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedNeoverseV2.td:1035
+// Multiply long
+// NOTE: SOG p. 16, n. 2: How to specify late-forwarding between similar ops?
+def : InstRW<[V2Write_Mul], (instregex "^M(ADD|SUB)[WX]rrr$")>;
----------------
dmgreen wrote:
> It is usually done with read advances.
Thanks, I'll have a look. If you have any pointers to examples where read advances were used to model forwarding of instructions like `madd` and such, that would be greatly appreciated!

================
Comment at: llvm/test/tools/llvm-mca/AArch64/Neoverse/V2-neon-instructions.s:399
+fsub v0.2s, v0.2s, v0.2s
+ld1 { v0.16b }, [x0]
+ld1 { v0.2d, v1.2d, v2.2d }, [x0], #48
----------------
dmgreen wrote:
> Add more ldr tests perhaps.
I added a few more for H-form LDRs, but if you're referring to the FP loads they should be here already (you can grep for `ldr\s[hwxq]`).

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151894/new/

https://reviews.llvm.org/D151894