[llvm] [AArch64] Fix throughout of 64-bit SVE gather loads (PR #168572)

Thu Nov 27 01:59:25 PST 2025

c-rhodes wrote:

I had a closer look at this and understand better now. I checked with the CPU folks and these instructions are 4 mops, a pair of loads and a pair of FMOVs. The throughput (4/5) in the SWOG is correct, but the vector pipes are missing from the utilized pipelines. If you look at the SWOGs of the other Neoverse cores they all utilize the vector pipes.

Throughput is calculated here:
https://github.com/llvm/llvm-project/blob/f8eca64a2820553ffc22c58ac39c2e5c14888e61/llvm/lib/MC/MCSchedule.cpp#L98

so for this you added as an example:
```
def N3Write_6c_2GL : SchedWriteRes<[N3UnitL, N3UnitGL]> {
  let Latency     = 6;
  let NumMicroOps = 4;
  let ReleaseAtCycles = [3, 5];
}
```

it's roughly doing the following to get the right throughput:

```
rthroughput=1.0 / min(throughout=NumUnits / ReleaseAtCycle for proc_res in [N3UnitL, N3UnitGL])
rthroughput=1.0 / min(3/3=1, 4/5=0.8)
           =1.0 / min(1, 0.8)
           =1.0 / 0.8
           =1.25
```

I can see it's not possible to get the correct throughput with the existing resources as the max num of units for all resources is 2, so to get rthroughput=1.25 would mean 1.0 / (2/2.5), i.e a fractional ReleaseAtCycles, which isn't possible.

So ultimately a resource with 4 units is required. I did have a look and there is an alternative that would work with the existing resources:

```

diff --git a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
index 0b65a5f6b1e2..2006e69271ab 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedNeoverseN3.td
@@ -49,6 +49,11 @@ def N3UnitM : ProcResGroup<[N3UnitM0, N3UnitM1]>;
 def N3UnitL : ProcResGroup<[N3UnitL01, N3UnitL2]>;
 def N3UnitI : ProcResGroup<[N3UnitS, N3UnitM0, N3UnitM1]>;

+def N3UnitVL : ProcResGroup<[N3UnitL01, N3UnitV0, N3UnitV1]>;
+// Unused group to fix: "error: proc resource group overlaps with N3UnitVL but
+// no supergroup contains both."
+def : ProcResGroup<[N3UnitL01, N3UnitL2, N3UnitV0, N3UnitV1]>;
+
 //===----------------------------------------------------------------------===//

 def : ReadAdvance<ReadI,       0>;
@@ -366,6 +371,12 @@ def N3Write_8c_4V : SchedWriteRes<[N3UnitV, N3UnitV, N3UnitV, N3UnitV]> {
     let NumMicroOps = 4;
 }

+def N3Write_6c_2L01_2V : SchedWriteRes<[N3UnitVL]> {
+    let Latency = 6;
+    let NumMicroOps = 4;
+    let ReleaseAtCycles = [5];
+}
+
 //===----------------------------------------------------------------------===//
 // Define generic 6 micro-op types

@@ -2270,8 +2281,8 @@ def : InstRW<[N3Write_7c_4L], (instregex "^LDNT1[BHW]_ZZR_S$",
                                          "^LDNT1S[BH]_ZZR_S$")>;

 // Non temporal gather load, vector + scalar 64-bit element size
-def : InstRW<[N3Write_6c_2L], (instregex "^LDNT1S?[BHW]_ZZR_D$")>;
-def : InstRW<[N3Write_6c_2L], (instrs LDNT1D_ZZR_D)>;
+def : InstRW<[N3Write_6c_2L01_2V], (instregex "^LDNT1S?[BHW]_ZZR_D$")>;
+def : InstRW<[N3Write_6c_2L01_2V], (instrs LDNT1D_ZZR_D)>;

 // Contiguous first faulting load, scalar + scalar
 def : InstRW<[N3Write_6c_1L], (instregex "^LDFF1[BHWD]$",
```

It's not perfect but I think it's a bit more constrained and will cause less churn in the tests at least. Not sure if this has been considered before or what others think.

As an aside, it would be good if we could just explicitly set the throughput where we cant realisitically model it, instead of having to hack our way to it and potentially confuse people looking at this in the future thinking it's rooted in reality. Not sure if that's even possible, but I think the least we could do today is make it clear in such cases and strip it back to do the absolute bear minimum required to get the right value. I did see V#UnitFlg when reviewing the Neoverse V3 model recently and was a bit confused trying to understand where it came from looking at the SWOG until I looked at previous PRs.

https://github.com/llvm/llvm-project/pull/168572