[LLVMdev] [RFC] Benchmarking subset of the test suite

Sun May 4 14:01:33 PDT 2014

----- Original Message -----
> From: "Tobias Grosser" <tobias at grosser.es>
> To: "Hal Finkel" <hfinkel at anl.gov>, "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> Sent: Sunday, May 4, 2014 1:40:52 PM
> Subject: Re: [LLVMdev] [RFC] Benchmarking subset of the test suite
> 
> oOn 04/05/2014 14:39, Hal Finkel wrote:
> > At the LLVM Developers' Meeting in November, I promised to work on
> > isolating a subset of the current test suite that is useful for
> > benchmarking. Having looked at this in more detail, most of the
> > applications and benchmarks in the test suite are useful for
> > benchmarking, and so I think that a better way of phrasing it is
> > that we should construct a list of programs in the test suite that
> > are not useful for benchmarking.
> >
> > My proposed exclusion list is provided below. I constructed this
> > exclusion list primarily based on the following experiment: I ran
> > the test suite 10 times in three configurations: 1) On an IBM
> > POWER7 (P7) with -O3 -mvsx, 2) On a P7 at -O0 and 3) On an Intel
> > Xeon E5430 with -O3 all using make -j6. I then used the ministat
> > utility (which performs a T test) to compare the timings of the
> > two P7 configurations against each other and the Xeon
> > configuration, requiring a detectable difference at 99.5%
> > confidence. I looked for tests that showed no significant
> > difference in all three comparisons. The running configuration
> > here is purposefully noisy, the idea is to eliminate those tests
> > that are significantly sensitive to startup time, file I/O time,
> > memory bandwidth, etc., or just too short, and by running many
> > tests in parallel (non-deterministically), my hope is to eliminate
> > those tests can cannot usefully serve as benchmarks in a "normal"
> > environment.
> >
> > I'll admit being somewhat surprised by so many of the Prolangs and
> > Shootout "benchmarks" seemingly not serving as useful benchmarks;
> > perhaps someone can look into improving the problem size, etc. of
> > these.
> >
> > Without further ado, I propose that a test-suite configuration
> > designed for benchmarking exclude the following:
> 
> Hi Hal,
> 
> thanks for putting the effort! I think the systematic approach you
> have
> taken is very sensible.
> 
> I went through your list and looked at a couple of interesting cases.

Thanks! -- I figured you'd have something to add to this endeavor ;)

> For the shootout benchmarks I looked at the results and the history
> my
> LNT -O3 builder shows (long history, always 10 samples per run,
> http://llvm.org/perf/db_default/v4/nts/25326)
> 
> Some observations from my side:
> 
> ## Many benchmarks from your list have a runtime of zero seconds
> reported in my tester

This is true from my data is well.

> 
> ## For some of the benchmarks you propose, manually looking at the
>    historic samples allows a human to spot certain trends:
> 
>  > MultiSource/Benchmarks/Prolangs-C/football/football
> 
> http://llvm.org/perf/db_default/v4/nts/graph?show_all_points=yes&moving_window_size=10&plot.237=34.237.3&submit=Update
> 
>  > MultiSource/Benchmarks/Prolangs-C/simulator/simulator
> 
> http://llvm.org/perf/db_default/v4/nts/graph?show_all_points=yes&moving_window_size=10&plot.314=34.314.3&submit=Update
> 

Are these plots of compile time or execution time? Both of these say, "Type: compile_time". I did not consider compile time in my analysis, and I think that is a separate issue.

> ## Some other benchmarks with zero seconds execution time are not
> contained in your list. E.g.:
> 
> SingleSource/Benchmarks/Shootout/objinst
> SingleSource/Benchmarks/Shootout-C++/objinst

Interestingly, on my x86 machines this also executes for zero time, but at -O0 it takes a significant amount of time (and on PPC, even at -O3, it runs for about 0.0008s). So I think it is still useful to keep these.

> 
> ## Some benchmarks on your list are really _no_ benchmarks:
> 
> Shoothout hello:
> 
> #include <stdio.h>
> 
> int main() {
>      puts("hello world\n");
>      return(0);
> }
> 
> Shootout sumcol:
> 
> int main(int argc, char * * argv) {
>      char line[MAXLINELEN];
>      int sum = 0;
>      char buff[4096];
>      cin.rdbuf()->pubsetbuf(buff, 4096); // enable buffering
> 
>      while (cin.getline(line, MAXLINELEN)) {
>          sum += atoi(line);
>      }
>      cout << sum << '\n';
> }

Indeed.

> 
> To subsum, I believe this list might benefit from some improvements,
> but
> it seems to be a really good start. If someone wants to do a more
> extensive analysis, we can always analyze the historic data available
> in
> my -O3 performance buildbot. It should give us a very good idea on
> how
> noisy certain benchmarks are.

Sounds good to me.

 -Hal

> 
> Cheers,
> Tobias
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory