[PATCH] D152688: [Aarch64] Add Cortex-A510 specific scheduling

Wed Jun 21 08:54:58 PDT 2023

evandro added a comment.

How about some test results, please?

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:20
+  let IssueWidth = 3;         // It dual-issues under most circumstances
+  let LoadLatency = 3;        // Cycles for loads to access the cache. The
+                              // optimisation guide shows that most loads have
----------------
harviniriawan wrote:
> evandro wrote:
> > Integer loads take 2 cycles and that's the value that would be more sensible to use here.  The comment should be improved too.
> I think 3 is a good compromise, as in the future I'd like this to be the default scheduling for -mcpu=generic
Methinks that 3 would be too short in general.

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:56
+// instructions, which can mostly be dual-issued; that's why for now we model
+// them with 2 resources.
+def CortexA510UnitVALU0  : ProcResource<1> { let BufferSize = 0; } // SIMD/FP/SVE ALU0
----------------
harviniriawan wrote:
> evandro wrote:
> > You modeled the optional 128-bit wide VPU.  In my opinion, it would be better to model for the worst case, the narrower 64-bit wide VPU.  Thus code scheduled for the narrow VPU will typically run better on the wide VPU.
> I think we'd like to make the config that is preferred (2x128 bits) to be the default as SVE VL is a minimum of 128 bits.
The VL may be at least 128 bits, but the implementation could still be 64 bits wide.  Again, it's better to be early than late when scheduling, so it'd be better to model for 64 bits, which would run well in all A510.  

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:67
+// (the software optimisation guide lists latencies taking into account
+// typical forwarding paths).
+def : WriteRes<WriteImm, [CortexA510UnitALU]> { let Latency = 1; }    // MOVN, MOVZ
----------------
dmgreen wrote:
> harviniriawan wrote:
> > evandro wrote:
> > > I encourage you to model the forwarding paths as well, at least for those instructions whose throughput is 1 or less.
> > I think it's best to let the hardware takes care of this, as it is internally able to do so
> We haven't found the forwarding paths to be super useful in the past. They are difficult to model what is really going on in the core, and we have often found have caused more scheduling inaccuracies than they have helped with. The scheduler doesn't have a great way of modelling skewing.
That can be said of very super scalar pipelines.  When the throughput is just 1, it's important to make it easier for the core to digest instructions as quickly as possible.  After all, that's why there are scheduling models.

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:67
+// (the software optimisation guide lists latencies taking into account
+// typical forwarding paths).
+def : WriteRes<WriteImm, [CortexA510UnitALU]> { let Latency = 1; }    // MOVN, MOVZ
----------------
evandro wrote:
> dmgreen wrote:
> > harviniriawan wrote:
> > > evandro wrote:
> > > > I encourage you to model the forwarding paths as well, at least for those instructions whose throughput is 1 or less.
> > > I think it's best to let the hardware takes care of this, as it is internally able to do so
> > We haven't found the forwarding paths to be super useful in the past. They are difficult to model what is really going on in the core, and we have often found have caused more scheduling inaccuracies than they have helped with. The scheduler doesn't have a great way of modelling skewing.
> That can be said of very super scalar pipelines.  When the throughput is just 1, it's important to make it easier for the core to digest instructions as quickly as possible.  After all, that's why there are scheduling models.
I agree that there are cases when they can hurt throughput, but only when it's greater than 1.  The case that I have in mind is that it's better to have, say, a `MUL` execute in parallel with an unrelated `MADD` rather than sequentially with a dependent one.  But that only applies when there, in this case, more than one multiplier.  If there's only one multiplier, then it's important to schedule instructions whose combined latency is smaller if close together.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152688/new/

https://reviews.llvm.org/D152688