[PATCH] D152688: [Aarch64] Add Cortex-A510 specific scheduling
Evandro Menezes via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Jun 12 15:17:36 PDT 2023
evandro added a comment.
Can you please share some performance numbers showing that this scheduling model is beneficial most of the time?
================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:20
+ let IssueWidth = 3; // It dual-issues under most circumstances
+ let LoadLatency = 3; // Cycles for loads to access the cache. The
+ // optimisation guide shows that most loads have
----------------
Integer loads take 2 cycles and that's the value that would be more sensible to use here. The comment should be improved too.
================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:44
+
+def CortexA510UnitALU0 : ProcResource<1> { let BufferSize = 0; } // Int ALU0
+def CortexA510UnitALU12 : ProcResource<2> { let BufferSize = 0; } // Int ALU1 & ALU2
----------------
`BufferSize` should be in a block outside all these `ProcResource` lines.
================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:56
+// instructions, which can mostly be dual-issued; that's why for now we model
+// them with 2 resources.
+def CortexA510UnitVALU0 : ProcResource<1> { let BufferSize = 0; } // SIMD/FP/SVE ALU0
----------------
You modeled the optional 128-bit wide VPU. In my opinion, it would be better to model for the worst case, the narrower 64-bit wide VPU. Thus code scheduled for the narrow VPU will typically run better on the wide VPU.
================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:67
+// (the software optimisation guide lists latencies taking into account
+// typical forwarding paths).
+def : WriteRes<WriteImm, [CortexA510UnitALU]> { let Latency = 1; } // MOVN, MOVZ
----------------
I encourage you to model the forwarding paths as well, at least for those instructions whose throughput is 1 or less.
================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:77
+def : WriteRes<WriteIM32, [CortexA510UnitMAC]> { let Latency = 3; } // 32-bit Multiply
+def : WriteRes<WriteIM64, [CortexA510UnitMAC]> { let Latency = 4; } // 64-bit Multiply
+
----------------
64-bit multiplication takes from 4 to 5 cycles. It's usually better to be early than late, so 5 would be a more sensible value here. It also takes up the resource for 2 or 3 cycles, with 2 being a sensible value for `ResourceCycles`.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D152688/new/
https://reviews.llvm.org/D152688
More information about the llvm-commits
mailing list