[PATCH] D152688: [Aarch64] Add Cortex-A510 specific scheduling

Mon Jun 12 15:17:36 PDT 2023

evandro added a comment.

Can you please share some performance numbers showing that this scheduling model is beneficial most of the time?

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:20
+  let IssueWidth = 3;         // It dual-issues under most circumstances
+  let LoadLatency = 3;        // Cycles for loads to access the cache. The
+                              // optimisation guide shows that most loads have
----------------
Integer loads take 2 cycles and that's the value that would be more sensible to use here.  The comment should be improved too.

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:44
+
+def CortexA510UnitALU0   : ProcResource<1> { let BufferSize = 0; } // Int ALU0
+def CortexA510UnitALU12  : ProcResource<2> { let BufferSize = 0; } // Int ALU1 & ALU2
----------------
`BufferSize` should be in a block outside all these `ProcResource` lines.

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:56
+// instructions, which can mostly be dual-issued; that's why for now we model
+// them with 2 resources.
+def CortexA510UnitVALU0  : ProcResource<1> { let BufferSize = 0; } // SIMD/FP/SVE ALU0
----------------
You modeled the optional 128-bit wide VPU.  In my opinion, it would be better to model for the worst case, the narrower 64-bit wide VPU.  Thus code scheduled for the narrow VPU will typically run better on the wide VPU.

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:67
+// (the software optimisation guide lists latencies taking into account
+// typical forwarding paths).
+def : WriteRes<WriteImm, [CortexA510UnitALU]> { let Latency = 1; }    // MOVN, MOVZ
----------------
I encourage you to model the forwarding paths as well, at least for those instructions whose throughput is 1 or less.

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:77
+def : WriteRes<WriteIM32, [CortexA510UnitMAC]> { let Latency = 3; }   // 32-bit Multiply
+def : WriteRes<WriteIM64, [CortexA510UnitMAC]> { let Latency = 4; }   // 64-bit Multiply
+
----------------
64-bit multiplication takes from 4 to 5 cycles.  It's usually better to be early than late, so 5 would be a more sensible value here.  It also takes up the resource for 2 or 3 cycles, with 2 being a sensible value for `ResourceCycles`.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152688/new/

https://reviews.llvm.org/D152688