[LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3
ghassan_shobaki at yahoo.com
Thu Sep 19 11:13:42 PDT 2013
I should note here that although SPEC provided us with a sufficiently
large sample for our spill-count experiment, I don't think that SPEC has
enough hot functions with spills to make our execution-time results
statistically significant. That's because SPEC has many benchmarks with
peaky profiles, where one of two functions dominate the execution time.
So, if one heuristic gets very lucky (or unlucky) on a few hot
functions, it may get a deceivingly high (or low) score.
That's why I think if someone runs the same kind of test on a different
benchmark suite with comparable size, he may get different
execution-time results, but most likely he will get the same spill count results that we got (of course, I mean the relative results).
From: Renato Golin <renato.golin at linaro.org>
To: Ghassan Shobaki <ghassan_shobaki at yahoo.com>
Cc: Andrew Trick <atrick at apple.com>; "llvmdev at cs.uiuc.edu" <llvmdev at cs.uiuc.edu>
Sent: Thursday, September 19, 2013 8:27 PM
Subject: Re: [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3
On 19 September 2013 17:25, Ghassan Shobaki <ghassan_shobaki at yahoo.com> wrote:
Ghassan: You have made me so curious to try other benchmarks in our future work. Most academic publications on CPU performance though use SPEC. You can even find some recent publications that are still using SPEC CPU2000! When I was at AMD in 2009, performance optimization and benchmarking was all about SPEC CPU2006. Have things changed so much in the past 4 years?
Unfortunately, no. Most manufacturers still use SPEC (and others) to design, test and certify their products.
This is not a problem per se, as SPEC is very good and reasonably generic, but any single benchmark can't cover the wide range of applications a CPU is likely to undergo along its life. So, my grudge is that there isn't much effort into understanding how to benchmark the different uses of a CPU, not necessarily against SPEC. I think SPEC is a good match for your project.
And the more important question is: what specific features do these non-SPEC benchmarks have that are likely to affect the scheduler's register pressure reduction behavior?
No idea. ;) Mind you that I don't know any decent benchmark that will give you the "general user" case, but there are a number of specific benchmarks (browsers, codecs, databases, web servers all have benchmark features enabled).
Also, for your project, you're only interested in a very specific behaviour of a very specific part of the compiler (spills), so any benchmark will give you a way to test it, but every one will have some form of bias.
What I recommend is not to spend much time running a plethora of benchmarks, only to find out that they all tell you the same story, but try to find a benchmark that is completely different from SPEC (say, Browsermark or the MySQL benchmark suite) and see if the spill correlation is similar.
If it is, ignore it. If not, just mention that this correlation may not be seen with other benchmarks. ;)
Ghassan: Can you please give more specific features in these modern benchmarks that affect spill code reduction? Note that our study included over ten thousand functions with spills. Such a large sample is expected to cover many different kinds of behavior, and that's why I am calling it a "statistically significant" sample.
I was being a bit pedantic in pointing out that 10K data points are only statistically relevant if they're independent, which they might not be if each individual test was created / crafted with the same intent in mind (similar function size, number of functions, number of temporaries, etc).
Most programmers don't pay that much attention to good code and end up writing horrible code, that stress specific parts of the compiler. If you have access to PlumHall suite, I encourage you to compile the chapter 7.22 test as an example.
Also, related to register pressure, different bad codes will stress different algorithms, so you also have to be careful in stating that one algorithm is much better than others only based on one badly-written program.
Ghassan: Sorry if I did not include a clear enough description of the numbers meanings. Let me explain that more precisely:
>First of all, the "source" scheduler was indeed run for 9 iterations (which took about 2 days), and that was our baseline. All the numbers in the execution-time table are percentage differences relative to "source". Of course, there were random variations in the numbers, but we did the standard SPEC practice of taking the median. For most benchmarks, the random variation was not significant.
I see, my mistake.
There was one particular benchmark though (libquantum), on which we thought that the random variation is too large to make a meaningful comparison, and therefore we decided to exclude that.
Quite amusing, having the libquantum behaving erratically. ;)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev