[LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3

Ghassan Shobaki ghassan_shobaki at yahoo.com
Thu Sep 19 09:25:32 PDT 2013


Hi Renato,

Please see my answers below.

Thanks
-Ghassan




________________________________
 From: Renato Golin <renato.golin at linaro.org>
To: Ghassan Shobaki <ghassan_shobaki at yahoo.com> 
Cc: Andrew Trick <atrick at apple.com>; "llvmdev at cs.uiuc.edu" <llvmdev at cs.uiuc.edu> 
Sent: Thursday, September 19, 2013 5:30 PM
Subject: Re: [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3
 


On 17 September 2013 19:04, Ghassan Shobaki <ghassan_shobaki at yahoo.com> wrote:

We have done some experimental evaluation of the different schedulers in LLVM 3.3 (source, BURR, ILP, fast, MI). The evaluation was done
on x86-64 using SPEC CPU2006. We have measured both the amount of spill code as
well as the execution time as detailed below.
>

Hi Ghassan,

This is an amazing piece of work, thanks for doing this. We need more benchmarks like yours, and more often, too.


3. The source scheduler is the second best scheduler in terms of spill code and
execution time, and its performance is very close to that of BURR in both
metrics. This result is surprising for me, because, as far as I understand,
this scheduler is a conservative scheduler that tries to preserve the original
program order, isn't it? Does this result surprise you?

Well, SPEC is an old benchmark, when code was written to accommodate the hardware requirements, so preserving the code order might not be that big of a deal on SPEC, as it is on other types of code. So far, I haven't found SPEC being too good to judge overall compilers' performance, but specific micro-optimized features.

Besides, hardware and software are designed nowadays based on some version of Dhrystone, EEMBC, SPEC or CoreMark, so it's not impossible to see 50% increase in performance with little changes in either.

Ghassan: You have made me so curious to try other benchmarks in our future work. Most academic publications on CPU performance though use SPEC. You can even find some recent publications that are still using SPEC CPU2000! When I was at AMD in 2009, performance optimization and benchmarking was all about SPEC CPU2006. Have things changed so much in the past 4 years? And the more important question is: what specific features do these non-SPEC benchmarks have that are likely to affect the scheduler's register pressure reduction behavior? 
 

4. The ILP scheduler has the worst execution times on FP2006 and the second
worst spill counts, although it is the default on x86-64. Is this surprising?
BTW, Dragon Egg sets the scheduler to source. On Line 368 in Backend.cpp, we
find:
>
>if (!flag_schedule_insns)
>    Args.push_back("--pre-RA-sched=source");

This looks like someone ran a similar test and did the sensible thing. How that reflects with Clang, or how important it is to be the default, I don't know. This is the same discussion as the optimization levels, and what passes should be included in what. It also depends on which scheduler will evolve faster or further in time, and what kind of code you're compiling...


This is not a perfectly accurate
metric, but, given the large sample size (> 10K functions), the total number
of spills across such a statistically significant sample is believed to give a
very strong indication about each scheduler's performance at reducing register
pressure.

I agree this is a good enough metric, but I'd be cautious in stating that there is a "very strong indication about each scheduler's performance". SPEC is, after all, a special case in compiler/hardware world, and anything you see here might not happen anywhere else. 

Real world, modern code, (such as LAMP stack, browsers, office suites, etc) are written expecting the compiler to do magic, while old-school benchmarks weren't, and they were optimized for decades by both compiler and hardware engineers.

Ghassan: Can you please give more specific features in these modern benchmarks that affect spill code reduction? Note that our study included over ten thousand functions with spills. Such a large sample is expected to cover many different kinds of behavior, and that's why I am calling it a "statistically significant" sample.  


The %Diff Max (Min) is the maximum (minimum) percentage difference on a single
benchmark between each scheduler and the source scheduler. These numbers show the differences on individual FP benchmarks can be quite significant.

I'm surprised that you didn't run "source" 5/9 times, too. Did you get the exact performance numbers multiple times? Would be good to have a more realistic geo-mean for source as well, so we could estimate how much the other geo-means vary in comparison to source's.

Ghassan: Sorry if I did not include a clear enough description of the numbers meanings. Let me explain that more precisely:
First of all, the "source" scheduler was indeed run for 9 iterations (which took about 2 days), and that was our baseline. All the numbers in the execution-time table are percentage differences relative to "source". Of course, there were random variations in the numbers, but we did the standard SPEC practice of taking the median. For most benchmarks, the random variation was not significant. There was one particular benchmark though (libquantum), on which we thought that the random variation is too large to make a meaningful comparison, and therefore we decided to exclude that.
 
The "% Diff Max" and "% Diff Min" numbers reported in our table are NOT random variations on an individual benchmark. Rather, the "% Diff Max" for a given heuristic is the percentage difference 
(in median scores) between this heuristic and source heuristic for the benchmark 
on which this heuristic gave its the biggest *gain* relative to source. Similarly, the "% Diff Min" for a given heuristic is the percentage difference 
(in median scores) between this heuristic and source heuristic for the benchmark 
on which this heuristic gave its biggest *degradation* relative to source. So, they are for two different benchmarks. The point in giving these numbers is to show that, although the geometric-mean differences may look small, the differences on individual benchmarks were quite significant. I can provide more detailed numbers for all benchmarks if people are interested. I can post those on our web site or any benchmarking page that LLVM may have.        

Most of the above performance differences have been correlated with significant changes in spill counts in hot functions.

Which is a beautiful correlation between spill-rate and performance, showing that your metrics are at least reasonably accurate, for all purposes.


We should
probably report this as a performance bug if ILP stays the default scheduler on
x86-64.

You should, regardless of what's the default choice.

cheers,

--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/c36541d0/attachment.html>


More information about the llvm-dev mailing list