[LLVMdev] MI Scheduler Update (was Experimental Evaluation of the Schedulers in LLVM 3.3)

Mon Sep 23 23:10:26 PDT 2013

On Sep 17, 2013, at 11:04 AM, Ghassan Shobaki <ghassan_shobaki at yahoo.com> wrote:

> 1. The SD schedulers significantly impact the spill counts and the execution times for many benchmarks, but the machine instruction (MI) scheduler in 3.3 has very limited impact on both spill counts and execution times. Is this because most of you work on MI did not make it into the 3.3 release?

Ghassan, and anyone else interested in the scheduler:

This is a good time for me to give a thourough update of the MI scheduler. Hopefully that will answer many of your questions.

Some important things changed between the time I introduced the MI scheduler a year ago, and the release of 3.3. The biggest change was loop vectorization, which reduces register pressure and somewhat preschedules loops. Since 3.3 was released, the generic MI scheduler's heuristics were reevaluated in preparation for making it the default for targets without a custom scheduling strategy--more on that later. The source order scheduler was also fixed so that it actually preserves IR order, which is at least closer to source order.

For many benchmarks we've looked at, source order scheduling approaches the lower bound on register pressure--heuristics can only hurt--making it difficult to distinguish between a lucky scheduler and a good scheduler.

It's not surprising that SelectionDAG scheduling with BURR reduces spill code on average. It is fully register pressure aggressive. It gives highest priority to Sethi-Ullman number, which is typically nonsense, but does prevent some of the worst register pressure situations. It then does an expensive check to determine the shortest live range. This is also inaccurate, but on average reduces pressure.

The reason we switched from BURR to ILP a couple years ago was that although BURR is very aggressive, it is not very smart. Giving highest priority to inaccurate heuristics means generating pathologically bad schedules for some class of code. Regardless of how the programmer wrote the code, or what earlier passes have done, it will reschedule everything, fully serializing dependence chains. At that time, we noticed horrible performance on some crypto benchmarks. We decided to pay a small price in spill code for avoiding worst-case performance. We also realized after performance anlaysis, that incrementally tuning these heuristics to avoid test-suite regressions was not leading toward an overall better scheduler for real programs. We decided that, since some targets need an MI-level scheduler anyway, we should redirect efforts into that project.

The high-level design goal of MI scheduler is to allow subtargets to plug in custom scheduling strategies, while providing a "safe" generic scheduler. The generic scheduler is safe in that it preserves instruction order until it detects a performance problem according to the subtarget's machine model. This is a nice feature. It means that the scheduler should not often introduce a performance problem that did not already exist, and it makes the scheduled code much easier to understand and debug. So the close correlation between source order and MI scheduler is natural. In fact, you'll find that, when scheduling for SandyBridge, the scheduler seldom perturbs the instruction sequence. This is a fundamental departure from the conventional approach of scheduling for out-of-order processors as if they execute in-order.

This does raise a difficult challenge of how the scheduler can know when the out-of-order processor is likely to stall. The new machine model has enough information to roughly estimate stalls if a long enough execution trace can be fed through it. However, for very heavily out-of-order processors (Nehalem+) it is extremely rare for acyclic code to saturate any resources. As a cheap, partial solution, the MI scheduler now computes the cyclic critical path, allowing it to estimate.

One major advantage of the MI scheduler is that it models register
pressure with almost perfect precision. This is great for analyzing register pressure, but by itself isn't a solution, and greedy heuristics are often unable to solve the problem without backtracking. The difficulty hasn't been thinking of new heuristics and solving individual cases. Rather, finding a strong justification to add cost and complexity to the scheduler.

A month ago, Arnold Schwaighofer and I investigated this issue. We didn't do this because spilling was a serious performance problem, but because the performance of the scheduler is annoyingly random when governed by greedy heuristics. If the scheduler always did the right thing, that would simplify performance tracking. We were able to solve each individual case with some combination of heuristics. The most efficient approach I've found so far involves partitioning the DAG into subtrees (see computeDFSResult--I think the implementation of subtree is still somewhat flawed though). We've tried biased scheduling by subtree, computing Sethi-Ullman numbers according to the subtree partition, and tracking live-ins that are reachable from dag nodes, among other things.

Ultimately, we decided not to enable any of these techniques in the generic scheduler--targets are still free to do what they like. The problem is that there are always cases in which these cheap heuristics do the wrong thing. So, while we could engineer good results for SPEC, we would not be solving the underlying problem of unstable scheduling heuristics. Given the primary goals of reducing compile time and maintaining instruction order unless performance is at stake, the bar for adding heuristics is high. Complicating the heuristics now also means making them harder to understand and improve in the future.

I would like to see a general solution to scheduling for register pressure. I had plenty of ideas for more ad-hoc heuristcs within the bounds of list scheduling, but given that we haven't dominstrated the value of simple heuristics, I don't want to pursue anything more complicated. I think better solutions will have to transcend list scheduling. I do like to the idea of constraining the DAG prior to scheduling [Touati, "Register Saturation in Superscalar and VLIW Codes", CC 2001], because that entirely separates the problem from list scheduler heuristics. However, I won't be able to justify adding more complexity, beyond list scheduling heuristics, to the LLVM codebase to solve this problem. Work in this area would need to be done as side project. I don't expect to do any more work on it.

In my next message I'll explain the near-term plans for the scheduler.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130923/f8811d2d/attachment.html>