[PATCH] D152688: [Aarch64] Add Cortex-A510 specific scheduling

Wed Jun 14 04:33:15 PDT 2023

harviniriawan added a comment.

In D152688#4415494 <https://reviews.llvm.org/D152688#4415494>, @evandro wrote:

> Can you please share some performance numbers showing that this scheduling model is beneficial most of the time?

There's no performance degradation on SPEC2k17, and we get significant uplift on some SVE-specific workloads (about 4-5%) on the LITTLE core

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:20
+  let IssueWidth = 3;         // It dual-issues under most circumstances
+  let LoadLatency = 3;        // Cycles for loads to access the cache. The
+                              // optimisation guide shows that most loads have
----------------
evandro wrote:
> Integer loads take 2 cycles and that's the value that would be more sensible to use here.  The comment should be improved too.
I think 3 is a good compromise, as in the future I'd like this to be the default scheduling for -mcpu=generic

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:56
+// instructions, which can mostly be dual-issued; that's why for now we model
+// them with 2 resources.
+def CortexA510UnitVALU0  : ProcResource<1> { let BufferSize = 0; } // SIMD/FP/SVE ALU0
----------------
evandro wrote:
> You modeled the optional 128-bit wide VPU.  In my opinion, it would be better to model for the worst case, the narrower 64-bit wide VPU.  Thus code scheduled for the narrow VPU will typically run better on the wide VPU.
I think we'd like to make the config that is preferred (2x128 bits) to be the default as SVE VL is a minimum of 128 bits.

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:67
+// (the software optimisation guide lists latencies taking into account
+// typical forwarding paths).
+def : WriteRes<WriteImm, [CortexA510UnitALU]> { let Latency = 1; }    // MOVN, MOVZ
----------------
evandro wrote:
> I encourage you to model the forwarding paths as well, at least for those instructions whose throughput is 1 or less.
I think it's best to let the hardware takes care of this, as it is internally able to do so

================
Comment at: llvm/lib/Target/AArch64/AArch64SchedA510.td:663
+                        "^(SRH|SUQ|UQ|USQ|URH)ADD_ZPmZ_(UNDEF_)?[BHSD]$",
+                        "^(UQSUB|UQSUBR)_ZPmZ_(UNDEF_)?[BHSD]$")>;
+
----------------
rjj wrote:
> Same as above.
> 
> There are a few other places where the UNDEF patterns don't seem to exist, but before going through them, any reason why you've included these extra UNDEFs here?
I added them because I see from the debug output of machine-scheduler that sometimes it will generate pseudoinstructions with UNDEF* on some benchmarks, which then will pick up the wrong latency information without adding them into the regex. Maybe in the future we should make UNDEF and PSEUDO to always be at the end of the instruction for easier regex matching.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152688/new/

https://reviews.llvm.org/D152688