[PATCH] Add benchmarking-only mode to the test suite

Tue May 20 05:11:27 PDT 2014

Tobias, I can't reproduce your findings on my machine. Even if I
disabled output(removing -DPOLYBENCH_DUMP_ARRAYS) and piped to
/dev/null, I still get lots of spikes. I think we need to exclude
those tests until we find out how to stabilize those results.

On 18 May 2014 12:08, Yi Kong <kongy.dev at gmail.com> wrote:
> I think that's due to the vast amount of output it produces. Maybe
> replacing the output with an accumulator with give a more stable
> result?
>
> On 17 May 2014 22:34, Tobias Grosser <tobias at grosser.es> wrote:
>> On 17/05/2014 14:08, Yi Kong wrote:
>>>
>>> On 16 May 2014 15:25, Hal Finkel <hfinkel at anl.gov> wrote:
>>>>
>>>> ----- Original Message -----
>>>>>
>>>>> From: "Yi Kong" <kongy.dev at gmail.com>
>>>>> To: "Hal Finkel" <hfinkel at anl.gov>
>>>>> Cc: "Eric Christopher" <echristo at gmail.com>, "llvm-commits"
>>>>> <llvm-commits at cs.uiuc.edu>, "Tobias Grosser"
>>>>> <tobias at grosser.es>
>>>>> Sent: Thursday, May 15, 2014 5:41:04 PM
>>>>> Subject: Re: [PATCH] Add benchmarking-only mode to the test suite
>>>>>
>>>>> On 15 May 2014 13:59, Hal Finkel <hfinkel at anl.gov> wrote:
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>>
>>>>>>> From: "Yi Kong" <kongy.dev at gmail.com>
>>>>>>> To: "Hal Finkel" <hfinkel at anl.gov>
>>>>>>> Cc: "Eric Christopher" <echristo at gmail.com>, "llvm-commits"
>>>>>>> <llvm-commits at cs.uiuc.edu>, "Tobias Grosser"
>>>>>>> <tobias at grosser.es>
>>>>>>> Sent: Thursday, May 15, 2014 5:26:54 AM
>>>>>>> Subject: Re: [PATCH] Add benchmarking-only mode to the test suite
>>>>>>>
>>>>>>> Hi Hal Finkel,
>>>>>>>
>>>>>>> What's the criteria you use to to decide useful benchmarks?
>>>>>>
>>>>>>
>>>>>> Please refer to the LLVMDev thread "[RFC] Benchmarking subset of
>>>>>> the test suite" in which I explain my methadology in detail.
>>>>>
>>>>>
>>>>> I think the approach you've taken is indeed sensible. However I don't
>>>>> really agree with your make -j6 option. The Xeon chip you are testing
>>>>> on only has 4 core, which means a lot of context switch happens. The
>>>>
>>>>
>>>> It is a dual-socket machine.
>>>>
>>>>> noise produced by that would be far too great for "normal"
>>>>> environment. Also I believe that the testing machine should be as
>>>>> quiet as possible, otherwise we are actually measuring the noise!
>>>>
>>>>
>>>> This is obviously ideal, but rarely possible in practice. More to the
>>>> point, the buildbots are not quiet, but we still want to be able to extract
>>>> execution-time changes from them without a large number of false positives.
>>>> Some tests are just too sensitive to I/O time, or are too short, for this to
>>>> be possible (because you really are just seeing the noise), and this
>>>> exclusion list is meant to exclude such test. Given a sufficient number of
>>>> samples (10, for example), I've confirmed that it is possible to extract
>>>> meaningful timing differences from the others at high confidence.
>>>>
>>>>>
>>>>> I've been investigating the timeit tool in test suite. It turns out
>>>>> to
>>>>> be really inaccurate, and sometimes it's the main source of noise we
>>>>> are seeing. I've implemented using Linux perf tool to measure time.
>>>>> So
>>>>> far it seems to produce much better results. I will publish the
>>>>> finding with the patch in a separate thread once I've gathered enough
>>>>> data points. Maybe with the more accurate timing tool, we might get a
>>>>> different picture.
>>>>
>>>>
>>>> That's great!
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> I suggest you to also have a look at the standard deviation or
>>>>>>> MAD.
>>>>>>
>>>>>>
>>>>>> Of course this has already been considered and taken into account
>>>>>> ;)
>>>>>>
>>>>>>> Some of the tests have really large variance that we may not want
>>>>>>> to
>>>>>>> include when benchmarking, eg.
>>>>>>> Polybench/linear-algebra/kernels/3mm/3mm. I've attached a patch
>>>>>>> which
>>>>>>> makes tables sortable so that it is easier to investigate.
>>>>>>
>>>>>>
>>>>>> If you feel that there is a test or tests that have too large of a
>>>>>> variance for useful benchmarking, please compose a list, explain
>>>>>> your criteria, and we'll merge in some useful way.
>>>>>
>>>>>
>>>>> Mainly Polybench/linear-algebra, but I can't give you the list right
>>>>> now as LLVM LNT site is down again.
>>>
>>>
>>> These 5 tests have really large MAD on various testing machines, even
>>> with perf tools. Please add them the the exclusion list.
>>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm
>>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm
>>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm
>>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm
>>
>>
>> This is interesting. Those benchmarks should in fact give reliable
>> performance numbers (and they do so when I execute them). I just very
>> briefly looked into this and my observation was that, if I pipe the output
>> to a file or /dev/null, the gemm performance is always at the lower bound.
>> Only if I run 'timeit' I see these spikes. I see similar spikes if I just
>> print the output to the console.
>>
>> It would be great if we could understand where those spikes come from.
>>
>> Tobias