[llvm] [MachineScheduler] Experimental option to partially disable pre-ra scheduling. (PR #90181)

Fri Jun 14 00:01:40 PDT 2024

JonPsson1 wrote:

I have now tried to boil down the more important parts of my experimental results that were kind of successful on SystemZ to something more generic that would hopefully be of interest to other (OOO) targets.

This is still experimental and the major features are therefore controlled with CL options as seen below in order. My question now is if there is any possibility of collaboration across targets on this? Could this (if used on SystemZ) become a SchedStrategy in MachineScheduler, or need it be a SystemZ specific strategy? See previous comments for motivation for this. And as always, any feedback is most welcome.

**-misched-nohazards** (default true)

  Disables cycle hazards so that even if a predecessor is not ready on the cycle it is still put in Available queue instead of in Pending. The idea is to be able to reduce live ranges by e.g. putting a load immediately before its user.

**-misched-regpress** (default true)

  Attempt to reduce live range overlapping / spilling. My conclusions here (on SystemZ / SPEC) have been that the input (unscheduled) order is often quite good already from this perspective. The scheduler is probably most likely
  to mess this up by moving instructions around. This is because it seems that having "general" improvements in place does not help so much - like *always* scheduling a (imm-)load before its user. What does seem to be beneficial though is to compare two SUs and if one is decreasing pressure while the other one is increasing it (same pressure set), then put the load below.

  The DFS trees can also help additionally here by giving a bit of "lookahead". Each SU is (with this patch) mapped to a set of registers used and defined above it in its DFS subtree. If the SU can be scheduled as a unit with its subtree predecessors without causing any new registers to become live, it is done if this will close a live range by scheduling a def of a live register. I have found that subtrees of sizes 3-5 seem to work best.

  Regardless if the comparison between Cand and TryCand is done with just the SUnit:s or also using the DFS subtrees, a set of live registers is used to get the actual current effects. I have not yet tried to merge this with preexisting code.

  I have tried bi-directional scheduling with the idea of scheduling e.g. a store immediately after its defining instruction going top-down - kind of the reverse to the bottom-up "load immediately before use". This was interestingly not successful as expected so I have reverted to only do this bottom-up. The stores are now handled in a simpler way: given that source order is typically good for register pressure, and that a store can't just be pushed up in the list due to latency considerations of its predecessors I have found a less aggressive approach: Put the store after its predecessor as seen in the input order. This has the effect of moving up a store if it was for some reason put further down in the input. This is not done just for a memory store but for any instruction that kills a register without defining one.

 Todo: This should probably only kick in when useful but currently this is done regardless of the current register pressure.

**-misched-dfs** (default true)

  Compute the DFSResult for use per above.

**-dfs-size** (default 4)

  The max size of DFS subtrees that may be scheduled as a unit.

**-misched-heightheur** (default true)

  After register pressure heuristics, some kind of attempt for increasing ILP by considering e.g. height/depth of SUs is made. The idea is to do this without causing increased spilling, but the balance between these two still has room for improvement.

**-misched-heightifwfac** (default 3)

  A rough heuristic to only do the height heuristic (ILP) on DAGs with a "width factor" less than this value. The idea is to not become easily too aggressive with this on regions with many SUs that are mostly in parallell. Only if the DAG is quite "high and narrow" should this be done. This could likely be improved.

**-misched-ooo**

  Enables this experimental scheduling (if not passed the patch is NFC).

**-no-ooosched-below**

  If a non-zero value is passed normal scheduling is done for regions with lesser number of instructions than this value. This can be interesting as a strategy can have different levels of success on small/huge regions.

---

Originally, before the machine scheduler got enabled, SystemZ used to run the "Bottom-up register reduction list scheduling" for ISel scheduling. I tried this setup again now but found that doing so does *not* help cactus. Generally however, across benchmarks, this gives the least spill for regions less than 100 instructions. On some bigger regions though (>7500 instructions) "list-burr" actually causes more spilling than "source". So not just GenericScheduler messes this huge region up.

In the range 100 to 1000 instructions, the current behavior ("source" + mischeduler) seems to be beneficial, but it cannot handle huge regions.

This version of this is not a great improvement on SystemZ comparing to just disabling the MachineScheduler pre-ra entirely. I know there is a little room for further performance improvements, but I am not sure that would be enough to motivate a SystemZ specific scheduling strategy. Ideally other targets would find this interesting and begin to use/develop it.

@michaelmaitland Did you try to disable machine-scheduler and compare spilling? What happens if you try this (with -misched-ooo -misched-heightheur=false) 

@atrick Any comments on my use of the DFS subtrees?

https://github.com/llvm/llvm-project/pull/90181