[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler
Ghassan Shobaki
ghassan_shobaki at yahoo.com
Sat Sep 29 02:43:04 PDT 2012
Hi,
We are currently working on revising a journal article that describes
our work on pre-allocation scheduling using LLVM and have some questions about LLVM's pre-allocation scheduler. The answers to these question will help us better document and analyze the results of our benchmark tests that compare our algorithm with LLVM's pre-allocation scheduling algorithm.
First, here is a brief description of our work:
We have developed a combinatorial algorithm for balancing instruction-level parallelism (ILP) and register pressure (RP) during pre-allocation scheduling. The algorithm is based on a branch-and-bound (B&B) approach, wherein the objective function is a linear combination of schedule length and register pressure. We have implemented this algorithm and integrated it into LLVM 2.9 as an alternate pre-allocation scheduler. Then we compared the performance of our (B&B) scheduler with that of LLVM's default scheduler on x86 (BURR scheduler on x86-32 and ILP on x86-64) using SPEC CPU2006. The results show that our B&B scheduler significantly improves the performance of some benchmarks relative to LLVM's default scheduler by up to 21%. The geometric-mean speedup on FP2006 is about 2.4% across the entire suite. We then observed that LLVM's ILP scheduler on x86-64 uses "rough" latency values. So, we added the precise latency values published by Agner
(http://www.agner.org/optimize/) and that led to more speedup relative to LLVM's ILP scheduler on some benchmarks. The most significant gain from adding precise latencies was on the gromacs benchmark, which has a high degree of ILP. I am attaching the benchmarking results on x86-64 using both LLVM's rough latencies and Agner's precise latencies:
This work makes two points:
-A B&B
algorithm can discover significantly better schedules than a heuristic
can do for some larger hard-to-schedule blocks, and if such blocks
happen to occur in hot code, their scheduling will have a substantial
impact on
performance.
- A B&B algorithm is generally slower than a heuristic, but
it is not a slow as most people think. By applying such an algorithm
selectively to the hot blocks that are likely to benefit from it and
setting some compile-time budget, a significant performance gain may be
achieved with a relatively small increase in compile time.
My questions are:
1. Our current experimental results are based on LLVM 2.9. We
definitely plan on upgrading to the latest LLVM version in our future
work, but is there a fundamentally compelling reason for us to upgrade now to 3.1 for the sake of making the above points in the publication?
2. The BURR scheduler on x86-32 appears to set all latencies to one (which makes it a pure RR scheduler with no ILP), while the ILP scheduler on
x86-64 appears to set all latencies to 10 expect for a few long-latency
instructions. For the sake of documenting this in the paper, does anyone know (or can point
me to) a precise description of how the scheduler sets latency values?
In the revised paper, I will add experimental results based on precise
latency values (see the attached spreadsheet) and would like to clearly
document how LLVM's rough latencies for x86 are determined.
3. Was the choice to use rough latency values in the ILP scheduler based
on the fact that using precise latencies makes it much harder for a
heuristic non-backtracking scheduler to balance ILP and RP or the choice was made simply because nobody bothered to write an x86 itinerary?
4. Does the ILP scheduler ever consider scheduling a stall (leaving a
cycle empty) when there are ready instructions? Here is a small hypothetical example that explains what I mean:
Suppose
that at Cycle C the register pressure (RP) is equal to the physical
limit and all ready instructions in that cycle start new live ranges,
thus increasing the RP above the physical register limit. However, in a
later cycle C+Delta some instruction X that closes a currently open live
range will become ready. If the objective is minimizing RP, the right
choice to make in this case is leaving Cycles C through C+Delta-1 empty
and scheduling Instruction X in Cycle C+Delta. Otherwise, we will be
increasing the RP. Does the ILP scheduler ever make such a choice or it will
always schedule an instruction when the ready list is not empty?
Thank you in advance!
-Ghassan
Ghassan Shobaki
Assistant Professor
Department of Computer Science
Princess Sumaya University for Technology
Amman, Jordan
Attachments inlined:
Rough Latencies
Benchmark Branch-and-Bound LLVM
SPEC Score SPEC Score % Score Difference
400.perlbench 21.2 20.2 4.95%
401.bzip2 13.9 13.6 2.21%
403.gcc 19.5 19.8 -1.52%
429.mcf 20.5 20.5 0.00%
445.gobmk 18.6 18.6 0.00%
456.hmmer 11.1 11.1 0.00%
458.sjeng 19.3 19.3 0.00%
462.libquantum 39.5 39.5 0.00%
464.h264ref 28.5 28.5 0.00%
471.omnetpp 15.6 15.6 0.00%
473.astar 13 13 0.00%
483.xalancbmk 21.9 21.9 0.00%
GEOMEAN 19.0929865 19.00588287 0.46%
410.bwaves 15.2 15.2 0.00%
416.gamess CE CE #VALUE!
433.milc 19 18.6 2.15%
434.zeusmp 14.2 14.2 0.00%
435.gromacs 11.6 11.3 2.65%
436.cactusADM 8.31 7.89 5.32%
437.leslie3d 11 11 0.00%
444.namd 16 16 0.00%
447.dealII 25.4 25.4 0.00%
450.soplex 26.1 26.1 0.00%
453.povray 20.5 20.5 0.00%
454.calculix 8.44 8.3 1.69%
459.GemsFDTD 10.7 10.7 0.00%
465.tonto CE CE #VALUE!
470.lbm 38.1 31.5 20.95%
481.wrf 11.6 11.6 0.00%
482.sphinx3 28.2 26.9 4.83%
GEOMEAN 15.91486307 15.54419555 2.38%
Precise Latencies
Benchmark Branch-and-Bound LLVM
SPEC Score SPEC Score % Score Difference
400.perlbench 21.2 20.2 4.95%
401.bzip2 13.9 13.6 2.21%
403.gcc 19.6 19.8 -1.01%
429.mcf 20.8 20.5 1.46%
445.gobmk 18.8 18.6 1.08%
456.hmmer 11.1 11.1 0.00%
458.sjeng 19.3 19.3 0.00%
462.libquantum 39.5 39.5 0.00%
464.h264ref 28.5 28.5 0.00%
471.omnetpp 15.6 15.6 0.00%
473.astar 13 13 0.00%
483.xalancbmk 21.9 21.9 0.00%
GEOMEAN 19.14131861 19.00588287
0.71%
410.bwaves 15.5 15.2 1.97%
416.gamess CE CE #VALUE!
433.milc 19.3 18.6 3.76%
434.zeusmp 14.2 14.2 0.00%
435.gromacs 12.4 11.3 9.73%
436.cactusADM 7.7 7.89 -2.41%
437.leslie3d 11 11 0.00%
444.namd 16.2 16 1.25%
447.dealII 25.4 25.4 0.00%
450.soplex 26.1 26.1 0.00%
453.povray 20.5 20.5 0.00%
454.calculix 8.55 8.3 3.01%
459.GemsFDTD 10.5 10.7 -1.87%
465.tonto CE CE #VALUE!
470.lbm 38.8 31.5 23.17%
481.wrf 11.6 11.6 0.00%
482.sphinx3 28 26.9 4.09%
GEOMEAN 15.96082174 15.54419555 2.68%
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120929/517814e8/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: x86-64_BB_vs_LLVM_roughLatencies.xls
Type: application/vnd.ms-excel
Size: 10240 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120929/517814e8/attachment.xls>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: x86-64_BB_vs_LLVM_preciseLatencies.xls
Type: application/vnd.ms-excel
Size: 10240 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120929/517814e8/attachment-0001.xls>
More information about the llvm-dev
mailing list