[LLVMdev] [RFC] Benchmarking subset of the test suite

Sun May 4 11:40:52 PDT 2014

oOn 04/05/2014 14:39, Hal Finkel wrote:
> At the LLVM Developers' Meeting in November, I promised to work on isolating a subset of the current test suite that is useful for benchmarking. Having looked at this in more detail, most of the applications and benchmarks in the test suite are useful for benchmarking, and so I think that a better way of phrasing it is that we should construct a list of programs in the test suite that are not useful for benchmarking.
>
> My proposed exclusion list is provided below. I constructed this exclusion list primarily based on the following experiment: I ran the test suite 10 times in three configurations: 1) On an IBM POWER7 (P7) with -O3 -mvsx, 2) On a P7 at -O0 and 3) On an Intel Xeon E5430 with -O3 all using make -j6. I then used the ministat utility (which performs a T test) to compare the timings of the two P7 configurations against each other and the Xeon configuration, requiring a detectable difference at 99.5% confidence. I looked for tests that showed no significant difference in all three comparisons. The running configuration here is purposefully noisy, the idea is to eliminate those tests that are significantly sensitive to startup time, file I/O time, memory bandwidth, etc., or just too short, and by running many tests in parallel (non-deterministically), my hope is to eliminate those tests can cannot usefully serve as benchmarks in a "normal" environment.
>
> I'll admit being somewhat surprised by so many of the Prolangs and Shootout "benchmarks" seemingly not serving as useful benchmarks; perhaps someone can look into improving the problem size, etc. of these.
>
> Without further ado, I propose that a test-suite configuration designed for benchmarking exclude the following:

Hi Hal,

thanks for putting the effort! I think the systematic approach you have 
taken is very sensible.

I went through your list and looked at a couple of interesting cases. 
For the shootout benchmarks I looked at the results and the history my 
LNT -O3 builder shows (long history, always 10 samples per run, 
http://llvm.org/perf/db_default/v4/nts/25326)

Some observations from my side:

## Many benchmarks from your list have a runtime of zero seconds 
reported in my tester

## For some of the benchmarks you propose, manually looking at the
   historic samples allows a human to spot certain trends:

 > MultiSource/Benchmarks/Prolangs-C/football/football

http://llvm.org/perf/db_default/v4/nts/graph?show_all_points=yes&moving_window_size=10&plot.237=34.237.3&submit=Update

 > MultiSource/Benchmarks/Prolangs-C/simulator/simulator

http://llvm.org/perf/db_default/v4/nts/graph?show_all_points=yes&moving_window_size=10&plot.314=34.314.3&submit=Update

## Some other benchmarks with zero seconds execution time are not 
contained in your list. E.g.:

SingleSource/Benchmarks/Shootout/objinst
SingleSource/Benchmarks/Shootout-C++/objinst

## Some benchmarks on your list are really _no_ benchmarks:

Shoothout hello:

#include <stdio.h>

int main() {
     puts("hello world\n");
     return(0);
}

Shootout sumcol:

int main(int argc, char * * argv) {
     char line[MAXLINELEN];
     int sum = 0;
     char buff[4096];
     cin.rdbuf()->pubsetbuf(buff, 4096); // enable buffering

     while (cin.getline(line, MAXLINELEN)) {
         sum += atoi(line);
     }
     cout << sum << '\n';
}

To subsum, I believe this list might benefit from some improvements, but 
it seems to be a really good start. If someone wants to do a more 
extensive analysis, we can always analyze the historic data available in 
my -O3 performance buildbot. It should give us a very good idea on how 
noisy certain benchmarks are.

Cheers,
Tobias