[llvm] [SystemZ] Add a SystemZ specific pre-RA scheduling strategy. (PR #135076)

Fri Aug 29 01:08:34 PDT 2025

JonPsson1 wrote:

> As to latency - common code uses logic in shouldReduceLatency based on critical path length to decide whether or not latency is important right now. What is the difference between that common code method and the custom check you're implementing here?

**shouldReduceLatency()** can give a decision before each picked SU based on the remaining critical path in the given moment. This is based on the current cycle and remaining latency. The **HasDataSequences** I implemented looks at a region before scheduling and then triggers latency reduction across the entire region - before each picked node.

I have generally tried to move away from the cycle-based scheduling heuristics used by GenericScheduler as they mostly make exact sense for an in-order target during pre-RA scheduling, I'd say. I could however not really make the argument for one or the other except by trying it out, which I did:

First, I examined how many times in SystemZPreRASchedStrategy::tryCandidate() at the point of interest (below liveness reduction) these would yield "true":

```
Total                 : 4.8M queries
shouldReduceLatency() : 3M   
HasDataSequence       : 200K 
HasDataSequence && shouldReduceLatency: 42K  
```

In other words, 4.8M times (note that this number includes all the iterations across the Available queue) would be "always true", barred the initial check to exclude some very "wide" DAGs. This is what was done before improving this recently. shouldReduceLatency() cuts that almost in half, while HasDataSequence is relatively rare.

When I noticed a regression after returning to this project on lbm, I tried disabling this latency heuristic alltogether, and saw these results:

```
patch <> patch without latency heuristic
Improvements:
0.965: f519.lbm_r 
0.976: i505.mcf_r 
0.981: f544.nab_r 
Regressions:
1.052: f508.namd_r 
1.051: f507.cactuBSSN_r 
1.026: f526.blender_r 
```

The lbm problem went away, but other benchmarks suffered.

After working with this and adding the (HasDataSequences || Rem.IsAcyclicLatencyLimited) guard (per the patch posted here currently), I see these numbers:

```
main <> patch
Improvements:
0.885: f507.cactuBSSN_r 
0.963: f544.nab_r 
0.974: f526.blender_r 
0.987: f508.namd_r 
0.990: i531.deepsjeng_r 
0.998: i505.mcf_r 
Regressions:
1.007: i523.xalancbmk_r 
1.004: f519.lbm_r

```
Now, back to the experiment with shouldReduceLatency(). I tried two changes: using shouldReduceLatency() instead of HasDataSequences, or both combined like (HasDataSequences && shouldReduceLatency()).

There is a slight (+2000 spills/reloads) increase in spilling with shouldReduceLatency(), which is sort of as expected.

Performancewise, it seems to e.g. handle cactus equally well, but there are slight losses on two benchmarks:

```
patch <> patch using shouldReduceLatecy()
Regressions:
1.019: f526.blender_r 
1.010: i500.perlbench_r 
1.008: i505.mcf_r 
```

```
patch <> patch using (HasDataSequences && shouldReduceLatency())
Regressions:
1.028: f526.blender_r 
1.019: i500.perlbench_r 
```

Given the results, it seems better to schedule more agressively for latency throughout the region than to do it per the cycle based critical path heuristic - if done only on the regions where this matters. I get some confidence in HasDataSequence as it gives better performance with a lot less involvement, and also better than the combination of the two.


```

diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h
index a0ef475..c21bb33 100644
--- a/llvm/include/llvm/CodeGen/MachineScheduler.h
+++ b/llvm/include/llvm/CodeGen/MachineScheduler.h
@@ -1224,7 +1224,7 @@ protected:
   void traceCandidate(const SchedCandidate &Cand);
 #endif
 
-private:
+protected:
   bool shouldReduceLatency(const CandPolicy &Policy, SchedBoundary &CurrZone,
                            bool ComputeRemLatency, unsigned &RemLatency) const;
 };
diff --git a/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp b/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
index a3df7fc..dff2700 100644
--- a/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
@@ -344,7 +344,14 @@ bool SystemZPreRASchedStrategy::tryCandidate(SchedCandidate &Cand,
     // Don't extend the scheduled latency in regions with many nodes in
     // simple data sequences, or for (single block loop) regions that are
     // acyclically (within a single loop iteration) latency limited.
-    if ((HasDataSequences || Rem.IsAcyclicLatencyLimited) &&
+
+    // EXPERIMENT: Use GenericSchedulerBase::shouldReduceLatency() instead of HasDataSequences
+    CandPolicy P;
+    getRemLat(Zone);
+    bool GenericShouldReduceLat = shouldReduceLatency(P, *Zone, false, RemLat);
+    if (((HasDataSequences && GenericShouldReduceLat) || Rem.IsAcyclicLatencyLimited) &&
+ // if ((GenericShouldReduceLat || Rem.IsAcyclicLatencyLimited) &&
+ // if ((HasDataSequences || Rem.IsAcyclicLatencyLimited) &&
         TryCand.SU->getHeight() != Cand.SU->getHeight() &&
         (std::max(TryCand.SU->getHeight(), Cand.SU->getHeight()) >
          Zone->getScheduledLatency())) {
```


https://github.com/llvm/llvm-project/pull/135076