[llvm] [SystemZ] Add a SystemZ specific pre-RA scheduling strategy. (PR #135076)

Wed Nov 19 15:46:55 PST 2025

JonPsson1 wrote:

Some experiments with liveness reduction (not committed to the branch), as to when to do / not do it when all uses are live:

- Trying different values of "topcycles", which affects both the HighSUs set and the HasDistToTop margin: increasing from the default of 2 to 3 and 4 gave worse perf results,  while 5 and 6 got better again and then 7 worse again. 5 was best of these with a +0.74% average change across all benchmarks compared to 2. Conclusion: 2 seems best.

- Replacing HasDistToTop with "SubsumedByRemLat", with the idea of trying to be more precise and do differently with regions containing long latency instructions, instead of just counting the number remaining as a crude margin to the top. Looking at SU and its (closest) data successor as a unit, there is the remaining latency of (other) unscheduled nodes to consider.
```
           |   D  |
  Succ --> SU
  Succ ---------> SU
       ------------> RemLat
Placing SU closer to Succ means D more (decoding) cycles are added to SU.
```  

computed with "The decoding cycles for scheduling SU next plus its latency is less than the rem latency of the successor":
`  NumLeft / IssueWidth + SU->Latency < Remaining latency of (closest) data successor.
`
```
Counting number of spill/reload (and copys) instructions in SPEC output:

                                 main   "With tiny regions limit"
Spill|Reload   :               532477               528520    -3957        // -0.75%
Copies         :               886962               886644     -318

                                 main   "HighSUs instead of TinyRegion"    (performance ref below)
Spill|Reload   :               532477               530071    -2406
Copies         :               886962               886732     -230

                                 main   "HighSUs, but with Pres/Subsumes"  (similar performance)
Spill|Reload   :               532477               528575    -3902
Copies         :               886962               887234     +272

                                 main   "HighSUs, but with Subsumes only"  (slightly worse perf)
Spill|Reload   :               532477               528468    -4009
Copies         :               886962               886944      -18

                                 main   "No liveness reduction"            (slightly worse perf)
Spill|Reload   :               532477               532461      -16
Copies         :               886962               886785     -177

```
Conclusion: the liveness reduction heuristic reduces spilling a bit, but performance is not directly in proportion to this alone, showing that it is important to consider other things such as latencies while helping liveness. SubsumedByRemLat is more involved but doesn't give any performance improvement.

- Another idea was to skip the HasDistToTop and only rely on HighSUs for the top margin. I tried various values of TopCycles (2 - 11), and found that around 6 or 7 this seemed to work fairly well with similar perf results (within 0.1% on average). Conclusion: using TopCycles of 6 as default could work, eliminating the computation of HasDistToTop and also showing that this is in fact mostly useful if used only in regions with at least a few dozen instructions.

```
                                 main   "No HasDistToTop, TopCycles=6"     (similar performance)
Spill|Reload   :               532477               531009    -1468
Copies         :               886962               886822     -140
```

https://github.com/llvm/llvm-project/pull/135076