[llvm] [MachineScheduler] Experimental option to partially disable pre-ra scheduling. (PR #90181)

Tue May 21 03:14:52 PDT 2024

JonPsson1 wrote:

@michaelmaitland Thank you for taking your time with this! Interesting that on RISC-V this does not seem to matter much, at least generally. It would then be interesting to see if any individual file got better on RISC-V by disabling the scheduler (-mllvm -enable-misched=false). It would be most interesting to see this per scheduled region, or even per function, but any big regression should also be visible on a file comparison.

I fully agree that this option is only temporary, if even that. Instead of actually pushing this patch as it is, I will soon add here some experimental pre-RA heuristics that should at least in theory be better for an out-of-order target. I am hoping that this will be evaluated on other OOO targets, and if there indeed is a common ground here, some new OOO strategy could be developed.

> Have you looked at the instructions that are leading to the performance differences? Are there certain kinds of instructions or instruction sequences that are getting impacted?

That's a good question, but I have not a simple answer to that. One observation is that the instructions are before scheduling in a fairly reasonable order with long chains of def-use-def-use... but then the scheduler decides to mix those chains up so that a lot more registers are live in parallel than before. The GenericScheduler is dumb in that it does this after looking at just different SUnits, and does not consider anything like sub-trees or groups of instructions. It will never do some "lookahead" and finish of a sequence of instructions where it could reduce liveness, but it may instead continue to add too much ILP even though the heuristic is related to register pressure. So I would say that in this worst case (cactus) it is simply adding too much ILP in the huge block, while the unscheduled instructions where in a pretty good order to begin with.

> Have you looked at what heuristics are causing the ordering that leads to the spills?

For one thing, it is looking at the *input* order of the instructions and determines the type of register pressure set that is the most important to keep track of during scheduling. In this FP-intensive region, the GPRs are incorrectly deemed most important, and so the scheduler causes a lot of FP spills. But as said earlier, fixing this alone did not help much. 

Maybe the real problem is that GenericScheduler is cycle-exact (with Available/Pending queues) and resource balancing. The register pressure heuristics seem to be nice add-ons on top of that but do not do the full job. Therefore, I am thinking of a different scheduler more aimed at OOO targets where register pressure is first priority, and then probably some ILP / latency optimization without causing any spilling. 

> How many instructions are in the Available queue during scheduling of the larger regions? Is there opportunity to refine your SchedModel so that the Avaliable and Pending are better populated?

As I will post here soon, one part of my experiments have been to not have the Pending queue at all (unless Available grows huge). The motivation is that for a non-cycle-exact target it doesn't matter if e.g. a load has a latency that makes it not fit in the current cycle, so schedule it directly instead and close the live-region.

https://github.com/llvm/llvm-project/pull/90181