[LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3

Thu Sep 19 07:18:59 PDT 2013

Our test machine has two Intel
Xeon E5540 processors running at 2.53 GHz with 24 GB of memory. Each CPU has 8
threads (16 threads in total). All our tests, however, were single threaded. Which result is particularly surprising for you? The low impact of the MI scheduler, the relatively good performance of the source scheduler or the relatively poor performance of the ILP scheduler?

Thanks
-Ghassan  

________________________________
 From: Benjamin Kramer <benny.kra at gmail.com>
To: Ghassan Shobaki <ghassan_shobaki at yahoo.com> 
Cc: Andrew Trick <atrick at apple.com>; "llvmdev at cs.uiuc.edu" <llvmdev at cs.uiuc.edu> 
Sent: Thursday, September 19, 2013 4:53 PM
Subject: Re: [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3

On 17.09.2013, at 20:04, Ghassan Shobaki <ghassan_shobaki at yahoo.com> wrote:

> Hi Andy,
> 
> We have done some experimental evaluation of the different schedulers in LLVM 3.3 (source, BURR, ILP, fast, MI). The evaluation was done on x86-64 using SPEC CPU2006. We have measured both the amount of spill code as well as the execution time as detailed below.
> 
> Here are our main findings:
> 
> 1. The SD schedulers significantly impact the spill counts and the execution times for many benchmarks, but the machine instruction (MI) scheduler in 3.3 has very limited impact on both spill counts and execution times. Is this because most of you work on MI did not make it into the 3.3 release? We don't have a strong motivation to test the trunk at this point (we'll wait for 3.4), because we are working on a publication and prefer to base that on an official release. However, if you tell me that you expect things to be significantly different in the trunk, we'll try to find the time to give that a shot (unfortunately, we only have one test machine, and SPEC tests take a lot of time as detailed below).
> 
> 2. The BURR scheduler gives the minimum amount of spill code and the best overall execution time (SPEC geo-mean).
> 
> 3. The source scheduler is the second best scheduler in terms of spill code and execution time, and its performance is very close to that of BURR in both metrics. This result is surprising for me, because, as far as I understand, this scheduler is a conservative scheduler that tries to preserve the original program order, isn't it? Does this result surprise you?
> 
> 4. The ILP scheduler has the worst execution times on FP2006 and the second worst spill counts, although it is the default on x86-64. Is this surprising? BTW, Dragon Egg sets the scheduler to source. On Line 368 in Backend.cpp, we find:
> if (!flag_schedule_insns)
>     Args.push_back("--pre-RA-sched=source");  
> 
> Here are the details of our results:
> 
> Spill Counts
> ---------------
> CPU2006 has a total of 47448 functions, out of which 10363 functions (22%) have spills. If we break this down by FP and INT, we’ll see that 42% of the functions in FP2006 have spills, while 10% of the functions in INT2006 have spills. The amount of spill code was measured by printing the number of ranges spilled by the default (greedy) register allocator (printing the variable NumSpilledRanges in InlineSpiller.cpp). This is not a perfectly accurate metric, but, given the large sample size (> 10K functions), the total number of spills across such a statistically significant sample is believed to give a very strong indication about each scheduler's performance at reducing register pressure. The differences in the table below are calculated relative to the source scheduler.
>  
> Heuristic
> Total
> Source
>  
>  
>  
> Spills
> Spills
> Spill Difference
> % Spill Difference
> Source
> 294471
> 294471
> 0
> 0.00%
> ILP
> 298222
> 294471
> 3751
> 1.27%
> BURR
> 287932
> 294471
> -6539
> -2.22%
> Fast
> 312787
> 294471
> 18316
> 6.22%
> source + MI
> 294979
> 294471
> 508
> 0.17%
> ILP + MI
> 296681
> 294471
> 2210
> 0.75%
> BURR + MI
> 289328
> 294471
> -5143
> -1.75%
> Fast + MI
> 302131
> 294471
> 7660
> 2.60%
> 
> So, the best register pressure reduction scheduler is BURR. Note that enabling the MI scheduler makes things better when the SD scheduler is relatively weak at reducing register pressure (fast or ILP), while it makes things worse when the SD scheduler is relatively good at reducing register pressure (BURR or source).
> 
> Execution Times
> ---------------------
> Execution times were measured by running the benchmarks on an x86-64 machine with 5 or 9 iterations per benchmark as indicated below (in most cases, no significant difference was observed between 9 iterations (which take about two days) and 5 iterations (which take about one day)). The differences in the table below are calculated relative to the source scheduler. The %Diff Max (Min) is the maximum (minimum) percentage difference on a single benchmark between each scheduler and the source scheduler. These numbers show the differences on individual FP benchmarks can be quite significant.
> 
> Heuristic
> FP %Diff
> FP %Diff
>  FP %Diff
> INT %Diff
> INT %Diff
> INT %Diff
> iterations
>  
> Geo-mean
> Max
> Min
> Geo-mean
> Max
> Min
>  
> source
> 0.00%
> 0.00%
> 0.00%
> 0.00%
> 0.00%
> 0.00%
>  
> ILP
> -2.02%
> 2.30%
> -22.04%
> 0.42%
> 3.61%
> -2.16%
> 9 iterations
> BURR
> 0.70%
> 8.62%
> -5.56%
> 0.66%
> 3.09%
> -1.40%
> 9 iteratation
> fast
> -1.34%
> 9.48%
> -6.72%
> 0.12%
> 3.09%
> -2.34%
> 5 iterations
> source + MI
> 0.21%
> 3.42%
> -1.26%
> -0.01%
> 0.83%
> -0.94%
> 5 iterations
>  
> Most of the above performance differences have been correlated with significant changes in spill counts in hot functions. Note that the ILP scheduler causes a degradation of 22% on one benchmark (lbm) relative to the source scheduler. We have verified that this happens because of poor scheduling that increases the register pressure and thus leads to generating excessive spills in this benchmark’s hottest loop. We should probably report this as a performance bug if ILP stays the default scheduler on x86-64.

I find the results surprising, too. What CPU did you perform your tests on, scheduler performance can vary a lot depending on the microarchitecture of your chip.

- Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/3b7373ae/attachment.html>