[PATCH] Add benchmarking-only mode to the test suite

Sun May 18 04:08:01 PDT 2014

I think that's due to the vast amount of output it produces. Maybe
replacing the output with an accumulator with give a more stable
result?

On 17 May 2014 22:34, Tobias Grosser <tobias at grosser.es> wrote:
> On 17/05/2014 14:08, Yi Kong wrote:
>>
>> On 16 May 2014 15:25, Hal Finkel <hfinkel at anl.gov> wrote:
>>>
>>> ----- Original Message -----
>>>>
>>>> From: "Yi Kong" <kongy.dev at gmail.com>
>>>> To: "Hal Finkel" <hfinkel at anl.gov>
>>>> Cc: "Eric Christopher" <echristo at gmail.com>, "llvm-commits"
>>>> <llvm-commits at cs.uiuc.edu>, "Tobias Grosser"
>>>> <tobias at grosser.es>
>>>> Sent: Thursday, May 15, 2014 5:41:04 PM
>>>> Subject: Re: [PATCH] Add benchmarking-only mode to the test suite
>>>>
>>>> On 15 May 2014 13:59, Hal Finkel <hfinkel at anl.gov> wrote:
>>>>>
>>>>> ----- Original Message -----
>>>>>>
>>>>>> From: "Yi Kong" <kongy.dev at gmail.com>
>>>>>> To: "Hal Finkel" <hfinkel at anl.gov>
>>>>>> Cc: "Eric Christopher" <echristo at gmail.com>, "llvm-commits"
>>>>>> <llvm-commits at cs.uiuc.edu>, "Tobias Grosser"
>>>>>> <tobias at grosser.es>
>>>>>> Sent: Thursday, May 15, 2014 5:26:54 AM
>>>>>> Subject: Re: [PATCH] Add benchmarking-only mode to the test suite
>>>>>>
>>>>>> Hi Hal Finkel,
>>>>>>
>>>>>> What's the criteria you use to to decide useful benchmarks?
>>>>>
>>>>>
>>>>> Please refer to the LLVMDev thread "[RFC] Benchmarking subset of
>>>>> the test suite" in which I explain my methadology in detail.
>>>>
>>>>
>>>> I think the approach you've taken is indeed sensible. However I don't
>>>> really agree with your make -j6 option. The Xeon chip you are testing
>>>> on only has 4 core, which means a lot of context switch happens. The
>>>
>>>
>>> It is a dual-socket machine.
>>>
>>>> noise produced by that would be far too great for "normal"
>>>> environment. Also I believe that the testing machine should be as
>>>> quiet as possible, otherwise we are actually measuring the noise!
>>>
>>>
>>> This is obviously ideal, but rarely possible in practice. More to the
>>> point, the buildbots are not quiet, but we still want to be able to extract
>>> execution-time changes from them without a large number of false positives.
>>> Some tests are just too sensitive to I/O time, or are too short, for this to
>>> be possible (because you really are just seeing the noise), and this
>>> exclusion list is meant to exclude such test. Given a sufficient number of
>>> samples (10, for example), I've confirmed that it is possible to extract
>>> meaningful timing differences from the others at high confidence.
>>>
>>>>
>>>> I've been investigating the timeit tool in test suite. It turns out
>>>> to
>>>> be really inaccurate, and sometimes it's the main source of noise we
>>>> are seeing. I've implemented using Linux perf tool to measure time.
>>>> So
>>>> far it seems to produce much better results. I will publish the
>>>> finding with the patch in a separate thread once I've gathered enough
>>>> data points. Maybe with the more accurate timing tool, we might get a
>>>> different picture.
>>>
>>>
>>> That's great!
>>>
>>>>
>>>>>
>>>>>>
>>>>>> I suggest you to also have a look at the standard deviation or
>>>>>> MAD.
>>>>>
>>>>>
>>>>> Of course this has already been considered and taken into account
>>>>> ;)
>>>>>
>>>>>> Some of the tests have really large variance that we may not want
>>>>>> to
>>>>>> include when benchmarking, eg.
>>>>>> Polybench/linear-algebra/kernels/3mm/3mm. I've attached a patch
>>>>>> which
>>>>>> makes tables sortable so that it is easier to investigate.
>>>>>
>>>>>
>>>>> If you feel that there is a test or tests that have too large of a
>>>>> variance for useful benchmarking, please compose a list, explain
>>>>> your criteria, and we'll merge in some useful way.
>>>>
>>>>
>>>> Mainly Polybench/linear-algebra, but I can't give you the list right
>>>> now as LLVM LNT site is down again.
>>
>>
>> These 5 tests have really large MAD on various testing machines, even
>> with perf tools. Please add them the the exclusion list.
>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm
>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm
>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm
>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm
>
>
> This is interesting. Those benchmarks should in fact give reliable
> performance numbers (and they do so when I execute them). I just very
> briefly looked into this and my observation was that, if I pipe the output
> to a file or /dev/null, the gemm performance is always at the lower bound.
> Only if I run 'timeit' I see these spikes. I see similar spikes if I just
> print the output to the console.
>
> It would be great if we could understand where those spikes come from.
>
> Tobias