[PATCH] Add benchmarking-only mode to the test suite

Sat May 17 14:34:27 PDT 2014

On 17/05/2014 14:08, Yi Kong wrote:
> On 16 May 2014 15:25, Hal Finkel <hfinkel at anl.gov> wrote:
>> ----- Original Message -----
>>> From: "Yi Kong" <kongy.dev at gmail.com>
>>> To: "Hal Finkel" <hfinkel at anl.gov>
>>> Cc: "Eric Christopher" <echristo at gmail.com>, "llvm-commits" <llvm-commits at cs.uiuc.edu>, "Tobias Grosser"
>>> <tobias at grosser.es>
>>> Sent: Thursday, May 15, 2014 5:41:04 PM
>>> Subject: Re: [PATCH] Add benchmarking-only mode to the test suite
>>>
>>> On 15 May 2014 13:59, Hal Finkel <hfinkel at anl.gov> wrote:
>>>> ----- Original Message -----
>>>>> From: "Yi Kong" <kongy.dev at gmail.com>
>>>>> To: "Hal Finkel" <hfinkel at anl.gov>
>>>>> Cc: "Eric Christopher" <echristo at gmail.com>, "llvm-commits"
>>>>> <llvm-commits at cs.uiuc.edu>, "Tobias Grosser"
>>>>> <tobias at grosser.es>
>>>>> Sent: Thursday, May 15, 2014 5:26:54 AM
>>>>> Subject: Re: [PATCH] Add benchmarking-only mode to the test suite
>>>>>
>>>>> Hi Hal Finkel,
>>>>>
>>>>> What's the criteria you use to to decide useful benchmarks?
>>>>
>>>> Please refer to the LLVMDev thread "[RFC] Benchmarking subset of
>>>> the test suite" in which I explain my methadology in detail.
>>>
>>> I think the approach you've taken is indeed sensible. However I don't
>>> really agree with your make -j6 option. The Xeon chip you are testing
>>> on only has 4 core, which means a lot of context switch happens. The
>>
>> It is a dual-socket machine.
>>
>>> noise produced by that would be far too great for "normal"
>>> environment. Also I believe that the testing machine should be as
>>> quiet as possible, otherwise we are actually measuring the noise!
>>
>> This is obviously ideal, but rarely possible in practice. More to the point, the buildbots are not quiet, but we still want to be able to extract execution-time changes from them without a large number of false positives. Some tests are just too sensitive to I/O time, or are too short, for this to be possible (because you really are just seeing the noise), and this exclusion list is meant to exclude such test. Given a sufficient number of samples (10, for example), I've confirmed that it is possible to extract meaningful timing differences from the others at high confidence.
>>
>>>
>>> I've been investigating the timeit tool in test suite. It turns out
>>> to
>>> be really inaccurate, and sometimes it's the main source of noise we
>>> are seeing. I've implemented using Linux perf tool to measure time.
>>> So
>>> far it seems to produce much better results. I will publish the
>>> finding with the patch in a separate thread once I've gathered enough
>>> data points. Maybe with the more accurate timing tool, we might get a
>>> different picture.
>>
>> That's great!
>>
>>>
>>>>
>>>>>
>>>>> I suggest you to also have a look at the standard deviation or
>>>>> MAD.
>>>>
>>>> Of course this has already been considered and taken into account
>>>> ;)
>>>>
>>>>> Some of the tests have really large variance that we may not want
>>>>> to
>>>>> include when benchmarking, eg.
>>>>> Polybench/linear-algebra/kernels/3mm/3mm. I've attached a patch
>>>>> which
>>>>> makes tables sortable so that it is easier to investigate.
>>>>
>>>> If you feel that there is a test or tests that have too large of a
>>>> variance for useful benchmarking, please compose a list, explain
>>>> your criteria, and we'll merge in some useful way.
>>>
>>> Mainly Polybench/linear-algebra, but I can't give you the list right
>>> now as LLVM LNT site is down again.
>
> These 5 tests have really large MAD on various testing machines, even
> with perf tools. Please add them the the exclusion list.
> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm
> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm
> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm
> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm

This is interesting. Those benchmarks should in fact give reliable 
performance numbers (and they do so when I execute them). I just very 
briefly looked into this and my observation was that, if I pipe the 
output to a file or /dev/null, the gemm performance is always at the 
lower bound. Only if I run 'timeit' I see these spikes. I see similar 
spikes if I just print the output to the console.

It would be great if we could understand where those spikes come from.

Tobias