[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

Thu Oct 4 22:31:08 PDT 2012

On Sep 29, 2012, at 11:37 AM, Evan Cheng <evan.cheng at apple.com> wrote:

> 
> On Sep 29, 2012, at 2:43 AM, Ghassan Shobaki <ghassan_shobaki at yahoo.com> wrote:
> 
>> Hi,
>> 
>> We are currently working on revising a journal article that describes our work on pre-allocation scheduling using LLVM and have some questions about LLVM's pre-allocation scheduler. The answers to these question will help us better document and analyze the results of our benchmark tests that compare our algorithm with LLVM's pre-allocation scheduling algorithm.
>> 
>> First, here is a brief description of our work:
>> 
>> We have developed a combinatorial algorithm for balancing instruction-level parallelism (ILP) and register pressure (RP) during pre-allocation scheduling. The algorithm is based on a branch-and-bound (B&B) approach, wherein the objective function is a linear combination of schedule length and register pressure. We have implemented this algorithm and integrated it into LLVM 2.9 as an alternate pre-allocation scheduler. Then we compared the performance of our (B&B) scheduler with that of LLVM's default scheduler on x86 (BURR scheduler on x86-32 and ILP on x86-64) using SPEC CPU2006. The results show that our B&B scheduler significantly improves the performance of some benchmarks relative to LLVM's default scheduler by up to 21%. The geometric-mean speedup on FP2006 is about 2.4% across the entire suite. We then observed that LLVM's ILP scheduler on x86-64 uses "rough" latency values. So, we added the precise latency values published by Agner (http://www.agner.org/optimize/) and that led to more speedup relative to LLVM's ILP scheduler on some benchmarks. The most significant gain from adding precise latencies was on the gromacs benchmark, which has a high degree of ILP. I am attaching the benchmarking results on x86-64 using both LLVM's rough latencies and Agner's precise latencies:
>> 
>> This work makes two points:
>> 
>> -A B&B algorithm can discover significantly better schedules than a heuristic can do for some larger hard-to-schedule blocks, and if such blocks happen to occur in hot code, their scheduling will have a substantial impact on performance.
>> - A B&B algorithm is generally slower than a heuristic, but it is not a slow as most people think. By applying such an algorithm selectively to the hot blocks that are likely to benefit from it and setting some compile-time budget, a significant performance gain may be achieved with a relatively small increase in compile time.
>> 
>> My questions are:
>> 
>> 1. Our current experimental results are based on LLVM 2.9. We definitely plan on upgrading to the latest LLVM version in our future work, but is there a fundamentally compelling reason for us to upgrade now to 3.1 for the sake of making the above points in the publication? 
> 
> Yes there is. While the pre-allocation scheduler has not had algorithmic changes during the past year it has received minor tweaks which can impact performance. Also note the scheduler is on its way out. I don't know when the article will be published. But it's possible by the time the paper is published, you would be essentially comparing against deprecated technology.

Ghassan,

Evan is right. Your speedups from enhancing the scheduler's latency are probably different with 3.1. But that doesn't really change your two main points that (1) you can further reduce register pressure and (2) your compile time isn't horrible. I think the major heuristic changes in the PreRA scheduler went in before llvm 2.9. OTOH, Prasantha makes a good point that the new register allocator may diminish the impact of a bad schedule.
At the very least, I would check the code being generated by 3.1 for the few benchmarks that  interest you to see how much the baseline changed.

Another thing that you may want to experiment with...

I implemented a register pressure tracking scheduler earlier this year. You can run it yourself on x86-64 with the flag -enable-misched. The current implementation is a platform for experimentation only. I haven't invested any time developing heuristics, which will need to work across a range of interesting benchmarks and subtargets. Others have done some performance analysis with the framework, and may be able to chime in on it's good points or weak points.

This is not based on Sethi-Ullman in any way. There are a number of interesting heuristics that I can think of implementing, but will not be doing so until I can justify adding the cost and complexity.

The current (experimental) MI scheduler implements a 3-level "back-off":

1) Respect the target's register limits at all times.

2) Indentify critical register classes (pressure sets) before scheduling.
   Track pressure within the currently scheduled region.
   Avoid increasing scheduled pressure for critical registers.

3) Avoid exceeding the max pressure of the region prior to scheduling (don't make things locally worse).

Think of these crude heuristics as baseline. This is what can be done easily and cheaply. Anything more sophisticated has to surpass this low bar.

All of the heuristics that I have planned are greedy, and none are sophisticated, but some require precomputing register lineages
(dependence chains that reuse a single register). An interesting twist is that the MI scheduler can alternate between top-up and bottom-down,
which doesn't fundamentally change the problem, but avoids the common cases in which greedy schedulers "get stuck".

Here's my plan for the near future:

- SpillCost: Map register units onto a spill cost that is more meaningful for heuristics.

- Pressure Query: (compile time) Redesign the pressure tracker to summarize information at the instruction level for fast queries during scheduling.

- Pressure Range: Before scheduling, compute the high pressure region as a range of instructions. If the scheduler is not currently under pressure, prioritize instructions from within the range.

- Register Lineages: Before scheduling, use a heuristic to select desirable lineages. Select the longest lineage from the queue. After scheduling an instruction, look at the next instruction in the lineage. If it has an unscheduled operand, mark that operand's lineage as pending, and priortize the head of that lineage. This solves some interesting cases where a greedy scheduler is normally unable to choose among a set of identical looking instructions by knowing how their dependence chain relates to any already scheduled instructions.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20121004/8ed68dd4/attachment.html>