[LLVMdev] [RFC] Benchmarking subset of the test suite

Sun May 4 14:10:36 PDT 2014

On 04/05/2014 23:01, Hal Finkel wrote:
> ----- Original Message -----
>> From: "Tobias Grosser" <tobias at grosser.es>
>> To: "Hal Finkel" <hfinkel at anl.gov>, "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
>> Sent: Sunday, May 4, 2014 1:40:52 PM
>> Subject: Re: [LLVMdev] [RFC] Benchmarking subset of the test suite
>>
>> oOn 04/05/2014 14:39, Hal Finkel wrote:
>>> At the LLVM Developers' Meeting in November, I promised to work on
>>> isolating a subset of the current test suite that is useful for
>>> benchmarking. Having looked at this in more detail, most of the
>>> applications and benchmarks in the test suite are useful for
>>> benchmarking, and so I think that a better way of phrasing it is
>>> that we should construct a list of programs in the test suite that
>>> are not useful for benchmarking.
>>>
>>> My proposed exclusion list is provided below. I constructed this
>>> exclusion list primarily based on the following experiment: I ran
>>> the test suite 10 times in three configurations: 1) On an IBM
>>> POWER7 (P7) with -O3 -mvsx, 2) On a P7 at -O0 and 3) On an Intel
>>> Xeon E5430 with -O3 all using make -j6. I then used the ministat
>>> utility (which performs a T test) to compare the timings of the
>>> two P7 configurations against each other and the Xeon
>>> configuration, requiring a detectable difference at 99.5%
>>> confidence. I looked for tests that showed no significant
>>> difference in all three comparisons. The running configuration
>>> here is purposefully noisy, the idea is to eliminate those tests
>>> that are significantly sensitive to startup time, file I/O time,
>>> memory bandwidth, etc., or just too short, and by running many
>>> tests in parallel (non-deterministically), my hope is to eliminate
>>> those tests can cannot usefully serve as benchmarks in a "normal"
>>> environment.
>>>
>>> I'll admit being somewhat surprised by so many of the Prolangs and
>>> Shootout "benchmarks" seemingly not serving as useful benchmarks;
>>> perhaps someone can look into improving the problem size, etc. of
>>> these.
>>>
>>> Without further ado, I propose that a test-suite configuration
>>> designed for benchmarking exclude the following:
>>
>> Hi Hal,
>>
>> thanks for putting the effort! I think the systematic approach you
>> have
>> taken is very sensible.
>>
>> I went through your list and looked at a couple of interesting cases.
>
> Thanks! -- I figured you'd have something to add to this endeavor ;)
>
>> For the shootout benchmarks I looked at the results and the history
>> my
>> LNT -O3 builder shows (long history, always 10 samples per run,
>> http://llvm.org/perf/db_default/v4/nts/25326)
>>
>> Some observations from my side:
>>
>> ## Many benchmarks from your list have a runtime of zero seconds
>> reported in my tester
>
> This is true from my data is well.
>
>>
>> ## For some of the benchmarks you propose, manually looking at the
>>     historic samples allows a human to spot certain trends:
>>
>>   > MultiSource/Benchmarks/Prolangs-C/football/football
>>
>> http://llvm.org/perf/db_default/v4/nts/graph?show_all_points=yes&moving_window_size=10&plot.237=34.237.3&submit=Update
>>
>>   > MultiSource/Benchmarks/Prolangs-C/simulator/simulator
>>
>> http://llvm.org/perf/db_default/v4/nts/graph?show_all_points=yes&moving_window_size=10&plot.314=34.314.3&submit=Update
>>
>
> Are these plots of compile time or execution time? Both of these say, "Type: compile_time". I did not consider compile time in my analysis, and I think that is a separate issue.

Good catch. I get it wrong. They also have zero seconds execution time, 
so they can probably be easily removed as well.

Tobias