[llvm-dev] mischeduler (pre-RA) experiments

Thu Nov 23 02:53:19 PST 2017

Hi,

I have been experimenting for a while with tryCandidate() method of the 
pre-RA mischeduler. I have by chance found some parameters that give 
quite good results on benchmarks on SystemZ (on average 1% improvement, 
some improvements of several percent and very little regressions). 
Basically, I add a "latency heuristic boost" just above processor 
resources checking:

tryCandidate() {
    ...

    // Avoid increasing the max pressure of the entire region.
    if (DAG->isTrackingPressure() && tryPressure(TryCand.RPDelta.CurrentMax,
         Cand.RPDelta.CurrentMax, TryCand, Cand, RegMax, TRI, DAG->MF))
      return;

    /// INSERTION POINT

    ...
}

I had started to experiment with adding tryLatency() in various places, 
and found this to be the best spot for SystemZ/SPEC-2006. This gave 
noticeable improvements immediately that were to good to ignore, so I 
started figuring out things about the regressions that of course also 
showed up. Eventually I have come up after many iterations a combined 
heuristic that reads:

if (((TryCand.Latency >= 7 && "Longest latency of any SU in DAG" < 15) ||
      "Number of SUnits in DAG" > 180)
      &&
      tryLatency(TryCand, Cand, *Zone))
         return;

In English: do tryLatency either if the latency of the candidate is >= 7 
and the DAG has no really long latency SUs (lat > 15), or alternatively 
always if the DAG is really big (>180 SUnits).

I am now looking for opinions on what to do next with this.

pros:

- Clearly beneficial on benchmarks.

- All the register pressure heuristics have been run, so there *should* 
not be increased spilling.

- This is mainly giving latency priority over resource balancing, which 
seems harmless if it shows to be beneficial.

- This gives higher ILP (register usage) and gives the SystemZ post-RA 
scheduler more freedom to improve decoder grouping etc.

cons:

- I am not sure if it is acceptable to have limits like this to control 
scheduling? Code generation can change in the years to come and who 
knows if those limits are safe then...

On the other hand, as I have been rebasing just recently, results have 
varied a bit but stayed stably beneficial. I have also managed to remove 
the sharp limit and improve on this concern a bit by having a zone from 
170-190 that makes the change more gradual as the DAG becomes "big". The 
values of 7 and 15 could as well be 6/8 or 30, so it's not really hyper 
sensitive either at the moment, I'd say.

I don't know about any better way of getting these experimental results, 
but of course it would be nice to know more and be able to say "why" 
this works, but this is indeed complex given the high OOO nature of 
SystemZ in addition to the regalloc effects etc.

Then there is the matter of implementation - could this become some kind 
of "latencyBoost" hook in the generic scheduler (after all, other 
targets might benefit also), or would SystemZ have to derive its own 
SchedStrategy (which isn't very nice if you just want to change one 
thing and still get future improvements of common code)?

I would appreciate any help and suggestions!

Jonas