[PATCH] Add benchmarking-only mode to the test suite

Tue May 20 06:54:32 PDT 2014

It's a really strange test case. The spikes disappear when I measure
cache/TLB misses using perf.

On 20 May 2014 13:47, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
>> From: "Yi Kong" <kongy.dev at gmail.com>
>> To: "Tobias Grosser" <tobias at grosser.es>
>> Cc: "Hal Finkel" <hfinkel at anl.gov>, "Eric Christopher" <echristo at gmail.com>, "llvm-commits"
>> <llvm-commits at cs.uiuc.edu>
>> Sent: Tuesday, May 20, 2014 7:11:27 AM
>> Subject: Re: [PATCH] Add benchmarking-only mode to the test suite
>>
>> Tobias, I can't reproduce your findings on my machine. Even if I
>> disabled output(removing -DPOLYBENCH_DUMP_ARRAYS) and piped to
>> /dev/null, I still get lots of spikes. I think we need to exclude
>> those tests until we find out how to stabilize those results.
>
> Okay, I'll also exclude them for now. How large is the working set? Could you be seeing TLB misses?
>
>  -Hal
>
>>
>> On 18 May 2014 12:08, Yi Kong <kongy.dev at gmail.com> wrote:
>> > I think that's due to the vast amount of output it produces. Maybe
>> > replacing the output with an accumulator with give a more stable
>> > result?
>> >
>> > On 17 May 2014 22:34, Tobias Grosser <tobias at grosser.es> wrote:
>> >> On 17/05/2014 14:08, Yi Kong wrote:
>> >>>
>> >>> On 16 May 2014 15:25, Hal Finkel <hfinkel at anl.gov> wrote:
>> >>>>
>> >>>> ----- Original Message -----
>> >>>>>
>> >>>>> From: "Yi Kong" <kongy.dev at gmail.com>
>> >>>>> To: "Hal Finkel" <hfinkel at anl.gov>
>> >>>>> Cc: "Eric Christopher" <echristo at gmail.com>, "llvm-commits"
>> >>>>> <llvm-commits at cs.uiuc.edu>, "Tobias Grosser"
>> >>>>> <tobias at grosser.es>
>> >>>>> Sent: Thursday, May 15, 2014 5:41:04 PM
>> >>>>> Subject: Re: [PATCH] Add benchmarking-only mode to the test
>> >>>>> suite
>> >>>>>
>> >>>>> On 15 May 2014 13:59, Hal Finkel <hfinkel at anl.gov> wrote:
>> >>>>>>
>> >>>>>> ----- Original Message -----
>> >>>>>>>
>> >>>>>>> From: "Yi Kong" <kongy.dev at gmail.com>
>> >>>>>>> To: "Hal Finkel" <hfinkel at anl.gov>
>> >>>>>>> Cc: "Eric Christopher" <echristo at gmail.com>, "llvm-commits"
>> >>>>>>> <llvm-commits at cs.uiuc.edu>, "Tobias Grosser"
>> >>>>>>> <tobias at grosser.es>
>> >>>>>>> Sent: Thursday, May 15, 2014 5:26:54 AM
>> >>>>>>> Subject: Re: [PATCH] Add benchmarking-only mode to the test
>> >>>>>>> suite
>> >>>>>>>
>> >>>>>>> Hi Hal Finkel,
>> >>>>>>>
>> >>>>>>> What's the criteria you use to to decide useful benchmarks?
>> >>>>>>
>> >>>>>>
>> >>>>>> Please refer to the LLVMDev thread "[RFC] Benchmarking subset
>> >>>>>> of
>> >>>>>> the test suite" in which I explain my methadology in detail.
>> >>>>>
>> >>>>>
>> >>>>> I think the approach you've taken is indeed sensible. However I
>> >>>>> don't
>> >>>>> really agree with your make -j6 option. The Xeon chip you are
>> >>>>> testing
>> >>>>> on only has 4 core, which means a lot of context switch
>> >>>>> happens. The
>> >>>>
>> >>>>
>> >>>> It is a dual-socket machine.
>> >>>>
>> >>>>> noise produced by that would be far too great for "normal"
>> >>>>> environment. Also I believe that the testing machine should be
>> >>>>> as
>> >>>>> quiet as possible, otherwise we are actually measuring the
>> >>>>> noise!
>> >>>>
>> >>>>
>> >>>> This is obviously ideal, but rarely possible in practice. More
>> >>>> to the
>> >>>> point, the buildbots are not quiet, but we still want to be able
>> >>>> to extract
>> >>>> execution-time changes from them without a large number of false
>> >>>> positives.
>> >>>> Some tests are just too sensitive to I/O time, or are too short,
>> >>>> for this to
>> >>>> be possible (because you really are just seeing the noise), and
>> >>>> this
>> >>>> exclusion list is meant to exclude such test. Given a sufficient
>> >>>> number of
>> >>>> samples (10, for example), I've confirmed that it is possible to
>> >>>> extract
>> >>>> meaningful timing differences from the others at high
>> >>>> confidence.
>> >>>>
>> >>>>>
>> >>>>> I've been investigating the timeit tool in test suite. It turns
>> >>>>> out
>> >>>>> to
>> >>>>> be really inaccurate, and sometimes it's the main source of
>> >>>>> noise we
>> >>>>> are seeing. I've implemented using Linux perf tool to measure
>> >>>>> time.
>> >>>>> So
>> >>>>> far it seems to produce much better results. I will publish the
>> >>>>> finding with the patch in a separate thread once I've gathered
>> >>>>> enough
>> >>>>> data points. Maybe with the more accurate timing tool, we might
>> >>>>> get a
>> >>>>> different picture.
>> >>>>
>> >>>>
>> >>>> That's great!
>> >>>>
>> >>>>>
>> >>>>>>
>> >>>>>>>
>> >>>>>>> I suggest you to also have a look at the standard deviation
>> >>>>>>> or
>> >>>>>>> MAD.
>> >>>>>>
>> >>>>>>
>> >>>>>> Of course this has already been considered and taken into
>> >>>>>> account
>> >>>>>> ;)
>> >>>>>>
>> >>>>>>> Some of the tests have really large variance that we may not
>> >>>>>>> want
>> >>>>>>> to
>> >>>>>>> include when benchmarking, eg.
>> >>>>>>> Polybench/linear-algebra/kernels/3mm/3mm. I've attached a
>> >>>>>>> patch
>> >>>>>>> which
>> >>>>>>> makes tables sortable so that it is easier to investigate.
>> >>>>>>
>> >>>>>>
>> >>>>>> If you feel that there is a test or tests that have too large
>> >>>>>> of a
>> >>>>>> variance for useful benchmarking, please compose a list,
>> >>>>>> explain
>> >>>>>> your criteria, and we'll merge in some useful way.
>> >>>>>
>> >>>>>
>> >>>>> Mainly Polybench/linear-algebra, but I can't give you the list
>> >>>>> right
>> >>>>> now as LLVM LNT site is down again.
>> >>>
>> >>>
>> >>> These 5 tests have really large MAD on various testing machines,
>> >>> even
>> >>> with perf tools. Please add them the the exclusion list.
>> >>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm
>> >>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm
>> >>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm
>> >>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm
>> >>
>> >>
>> >> This is interesting. Those benchmarks should in fact give reliable
>> >> performance numbers (and they do so when I execute them). I just
>> >> very
>> >> briefly looked into this and my observation was that, if I pipe
>> >> the output
>> >> to a file or /dev/null, the gemm performance is always at the
>> >> lower bound.
>> >> Only if I run 'timeit' I see these spikes. I see similar spikes if
>> >> I just
>> >> print the output to the console.
>> >>
>> >> It would be great if we could understand where those spikes come
>> >> from.
>> >>
>> >> Tobias
>>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory