[LLVMdev] LLVM's Pre-allocation Scheduler Tested against a Branch-and-Bound Scheduler

Sat Sep 29 03:36:42 PDT 2012

Hi Ghassan, this is very interesting, however...

> We are currently working on revising a journal article that describes our work
> on pre-allocation scheduling using LLVM and have some questions about LLVM's
> pre-allocation scheduler. The answers to these question will help us better
> document and analyze the results of our benchmark tests that compare our
> algorithm with LLVM's pre-allocation scheduling algorithm.
>
> First, here is a brief description of our work:
>
> We have developed a combinatorial algorithm for balancing instruction-level
> parallelism (ILP) and register pressure (RP) during pre-allocation scheduling.
> The algorithm is based on a branch-and-bound (B&B) approach, wherein the
> objective function is a linear combination of schedule length and register
> pressure. We have implemented this algorithm and integrated it into LLVM 2.9 as
> an alternate pre-allocation scheduler. Then we compared the performance of our
> (B&B) scheduler with that of LLVM's default scheduler on x86 (BURR scheduler on
> x86-32 and ILP on x86-64) using SPEC CPU2006. The results show that our B&B
> scheduler significantly improves the performance of some benchmarks relative to
> LLVM's default scheduler by up to 21%.

... are these differences statistically significant?  In my experience SPEC
scores can vary considerably on different runs, so how many runs did you do
and what was the estimated standard deviation?

Best wishes, Duncan.

  The geometric-mean speedup on FP2006 is
> about 2.4% across the entire suite. We then observed that LLVM's ILP scheduler
> on x86-64 uses "rough" latency values. So, we added the precise latency values
> published by Agner (http://www.agner.org/optimize/) and that led to more speedup
> relative to LLVM's ILP scheduler on some benchmarks. The most significant gain
> from adding precise latencies was on the gromacs benchmark, which has a high
> degree of ILP. I am attaching the benchmarking results on x86-64 using both
> LLVM's rough latencies and Agner's precise latencies:
>
> This work makes two points:
>
> -A B&B algorithm can discover significantly better schedules than a heuristic
> can do for some larger hard-to-schedule blocks, and if such blocks happen to
> occur in hot code, their scheduling will have a substantial impact on performance.
> - A B&B algorithm is generally slower than a heuristic, but it is not a slow as
> most people think. By applying such an algorithm selectively to the hot blocks
> that are likely to benefit from it and setting some compile-time budget, a
> significant performance gain may be achieved with a relatively small increase in
> compile time.
>
> My questions are:
>
> 1. Our current experimental results are based on LLVM 2.9. We definitely plan on
> upgrading to the latest LLVM version in our future work, but is there a
> fundamentally compelling reason for us to upgrade now to 3.1 for the sake of
> making the above points in the publication?
>
> 2. The BURR scheduler on x86-32 appears to set all latencies to one (which makes
> it a pure RR scheduler with no ILP), while the ILP scheduler on x86-64 appears
> to set all latencies to 10 expect for a few long-latency instructions. For the
> sake of documenting this in the paper, does anyone know (or can point me to) a
> precise description of how the scheduler sets latency values? In the revised
> paper, I will add experimental results based on precise latency values (see the
> attached spreadsheet) and would like to clearly document how LLVM's rough
> latencies for x86 are determined.
>
> 3. Was the choice to use rough latency values in the ILP scheduler based on the
> fact that using precise latencies makes it much harder for a heuristic
> non-backtracking scheduler to balance ILP and RP or the choice was made simply
> because nobody bothered to write an x86 itinerary?
>
> 4. Does the ILP scheduler ever consider scheduling a stall (leaving a cycle
> empty) when there are ready instructions? Here is a small hypothetical example
> that explains what I mean:
>
> Suppose that at Cycle C the register pressure (RP) is equal to the physical
> limit and all ready instructions in that cycle start new live ranges, thus
> increasing the RP above the physical register limit. However, in a later cycle
> C+Delta some instruction X that closes a currently open live range will become
> ready. If the objective is minimizing RP, the right choice to make in this case
> is leaving Cycles C through C+Delta-1 empty and scheduling Instruction X in
> Cycle C+Delta. Otherwise, we will be increasing the RP. Does the ILP scheduler
> ever make such a choice or it will always schedule an instruction when the ready
> list is not empty?
>
> Thank you in advance!
> -Ghassan
>
> Ghassan Shobaki
> Assistant Professor
> Department of Computer Science
> Princess Sumaya University for Technology
> Amman, Jordan
>
> Attachments inlined:
>
> Rough Latencies
>
> Benchmark 	Branch-and-Bound 	LLVM 	
>
> 	SPEC Score 	SPEC Score 	% Score Difference
> 400.perlbench 	21.2 	20.2 	4.95%
> 401.bzip2 	13.9 	13.6 	2.21%
> 403.gcc 	19.5 	19.8 	-1.52%
> 429.mcf 	20.5 	20.5 	0.00%
> 445.gobmk 	18.6 	18.6 	0.00%
> 456.hmmer 	11.1 	11.1 	0.00%
> 458.sjeng 	19.3 	19.3 	0.00%
> 462.libquantum 	39.5 	39.5 	0.00%
> 464.h264ref 	28.5 	28.5 	0.00%
> 471.omnetpp 	15.6 	15.6 	0.00%
> 473.astar 	13 	13 	0.00%
> 483.xalancbmk 	21.9 	21.9 	0.00%
> GEOMEAN 	19.0929865 	19.00588287 	    0.46%
> 410.bwaves	15.2 	15.2 	0.00%
> 416.gamess 	CE 	CE 	#VALUE!
> 433.milc	19 	18.6 	2.15%
> 434.zeusmp	14.2 	14.2 	0.00%
> 435.gromacs	11.6 	11.3 	2.65%
> 436.cactusADM 	8.31 	7.89 	5.32%
> 437.leslie3d 	11 	11 	0.00%
> 444.namd	16 	16 	0.00%
> 447.dealII 	25.4 	25.4 	0.00%
> 450.soplex 	26.1 	26.1 	0.00%
> 453.povray 	20.5 	20.5 	0.00%
> 454.calculix 	8.44 	8.3 	1.69%
> 459.GemsFDTD	10.7 	10.7 	0.00%
> 465.tonto 	CE 	CE 	#VALUE!
> 470.lbm 	38.1 	31.5 	20.95%
> 481.wrf 	11.6 	11.6 	0.00%
> 482.sphinx3 	28.2 	26.9 	4.83%
> GEOMEAN 	15.91486307 	15.54419555 	   2.38%
>
>
> Precise Latencies
>
> Benchmark 	Branch-and-Bound 	LLVM 	
>
> 	SPEC Score 	SPEC Score 	% Score Difference
> 400.perlbench 	21.2 	20.2 	4.95%
> 401.bzip2 	13.9 	13.6 	2.21%
> 403.gcc 	19.6 	19.8 	-1.01%
> 429.mcf 	20.8 	20.5 	1.46%
> 445.gobmk 	18.8 	18.6 	1.08%
> 456.hmmer 	11.1 	11.1 	0.00%
> 458.sjeng 	19.3 	19.3 	0.00%
> 462.libquantum 	39.5 	39.5 	0.00%
> 464.h264ref 	28.5 	28.5 	0.00%
> 471.omnetpp 	15.6 	15.6 	0.00%
> 473.astar 	13 	13 	0.00%
> 483.xalancbmk 	21.9 	21.9 	0.00%
> GEOMEAN 	19.14131861 	19.00588287
> 	0.71%
> 410.bwaves	15.5 	15.2 	1.97%
> 416.gamess 	CE 	CE 	#VALUE!
> 433.milc	19.3 	18.6 	3.76%
> 434.zeusmp	14.2 	14.2 	0.00%
> 435.gromacs	12.4 	11.3 	9.73%
> 436.cactusADM 	7.7 	7.89 	-2.41%
> 437.leslie3d 	11 	11 	0.00%
> 444.namd	16.2 	16 	1.25%
> 447.dealII 	25.4 	25.4 	0.00%
> 450.soplex 	26.1 	26.1 	0.00%
> 453.povray 	20.5 	20.5 	0.00%
> 454.calculix 	8.55 	8.3 	3.01%
> 459.GemsFDTD	10.5 	10.7 	-1.87%
> 465.tonto 	CE 	CE 	#VALUE!
> 470.lbm 	38.8 	31.5 	23.17%
> 481.wrf 	11.6 	11.6 	0.00%
> 482.sphinx3 	28 	26.9 	4.09%
> GEOMEAN 	15.96082174 	15.54419555 	    2.68%
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>