[LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3
renato.golin at linaro.org
Thu Sep 19 10:27:46 PDT 2013
On 19 September 2013 17:25, Ghassan Shobaki <ghassan_shobaki at yahoo.com>wrote:
> Ghassan: You have made me so curious to try other benchmarks in our future
> work. Most academic publications on CPU performance though use SPEC. You
> can even find some recent publications that are still using SPEC CPU2000!
> When I was at AMD in 2009, performance optimization and benchmarking was
> all about SPEC CPU2006. Have things changed so much in the past 4 years?
Unfortunately, no. Most manufacturers still use SPEC (and others) to
design, test and certify their products.
This is not a problem per se, as SPEC is very good and reasonably generic,
but any single benchmark can't cover the wide range of applications a CPU
is likely to undergo along its life. So, my grudge is that there isn't much
effort into understanding how to benchmark the different uses of a CPU, not
necessarily against SPEC. I think SPEC is a good match for your project.
And the more important question is: what specific features do these
> non-SPEC benchmarks have that are likely to affect the scheduler's register
> pressure reduction behavior?
No idea. ;) Mind you that I don't know any decent benchmark that will give
you the "general user" case, but there are a number of specific benchmarks
(browsers, codecs, databases, web servers all have benchmark features
Also, for your project, you're only interested in a very specific behaviour
of a very specific part of the compiler (spills), so any benchmark will
give you a way to test it, but every one will have some form of bias.
What I recommend is not to spend much time running a plethora of
benchmarks, only to find out that they all tell you the same story, but try
to find a benchmark that is completely different from SPEC (say,
Browsermark or the MySQL benchmark suite) and see if the spill correlation
If it is, ignore it. If not, just mention that this correlation may not be
seen with other benchmarks. ;)
Ghassan: Can you please give more specific features in these modern
> benchmarks that affect spill code reduction? Note that our study included
> over ten thousand functions with spills. Such a large sample is expected to
> cover many different kinds of behavior, and that's why I am calling it a
> "statistically significant" sample.
I was being a bit pedantic in pointing out that 10K data points are only
statistically relevant if they're independent, which they might not be if
each individual test was created / crafted with the same intent in mind
(similar function size, number of functions, number of temporaries, etc).
Most programmers don't pay that much attention to good code and end up
writing horrible code, that stress specific parts of the compiler. If you
have access to PlumHall suite, I encourage you to compile the chapter 7.22
test as an example.
Also, related to register pressure, different bad codes will stress
different algorithms, so you also have to be careful in stating that one
algorithm is much better than others only based on one badly-written
Ghassan: Sorry if I did not include a clear enough description of the
> numbers meanings. Let me explain that more precisely:
> First of all, the "source" scheduler was indeed run for 9 iterations
> (which took about 2 days), and that was our baseline. All the numbers in
> the execution-time table are percentage differences relative to "source".
> Of course, there were random variations in the numbers, but we did the
> standard SPEC practice of taking the median. For most benchmarks, the
> random variation was not significant.
I see, my mistake.
There was one particular benchmark though (libquantum), on which we thought
> that the random variation is too large to make a meaningful comparison, and
> therefore we decided to exclude that.
Quite amusing, having the libquantum behaving erratically. ;)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev