[PATCH] Add benchmarking-only mode to the test suite

Tue May 20 05:47:22 PDT 2014

----- Original Message -----
> From: "Yi Kong" <kongy.dev at gmail.com>
> To: "Tobias Grosser" <tobias at grosser.es>
> Cc: "Hal Finkel" <hfinkel at anl.gov>, "Eric Christopher" <echristo at gmail.com>, "llvm-commits"
> <llvm-commits at cs.uiuc.edu>
> Sent: Tuesday, May 20, 2014 7:11:27 AM
> Subject: Re: [PATCH] Add benchmarking-only mode to the test suite
> 
> Tobias, I can't reproduce your findings on my machine. Even if I
> disabled output(removing -DPOLYBENCH_DUMP_ARRAYS) and piped to
> /dev/null, I still get lots of spikes. I think we need to exclude
> those tests until we find out how to stabilize those results.

Okay, I'll also exclude them for now. How large is the working set? Could you be seeing TLB misses?

 -Hal

> 
> On 18 May 2014 12:08, Yi Kong <kongy.dev at gmail.com> wrote:
> > I think that's due to the vast amount of output it produces. Maybe
> > replacing the output with an accumulator with give a more stable
> > result?
> >
> > On 17 May 2014 22:34, Tobias Grosser <tobias at grosser.es> wrote:
> >> On 17/05/2014 14:08, Yi Kong wrote:
> >>>
> >>> On 16 May 2014 15:25, Hal Finkel <hfinkel at anl.gov> wrote:
> >>>>
> >>>> ----- Original Message -----
> >>>>>
> >>>>> From: "Yi Kong" <kongy.dev at gmail.com>
> >>>>> To: "Hal Finkel" <hfinkel at anl.gov>
> >>>>> Cc: "Eric Christopher" <echristo at gmail.com>, "llvm-commits"
> >>>>> <llvm-commits at cs.uiuc.edu>, "Tobias Grosser"
> >>>>> <tobias at grosser.es>
> >>>>> Sent: Thursday, May 15, 2014 5:41:04 PM
> >>>>> Subject: Re: [PATCH] Add benchmarking-only mode to the test
> >>>>> suite
> >>>>>
> >>>>> On 15 May 2014 13:59, Hal Finkel <hfinkel at anl.gov> wrote:
> >>>>>>
> >>>>>> ----- Original Message -----
> >>>>>>>
> >>>>>>> From: "Yi Kong" <kongy.dev at gmail.com>
> >>>>>>> To: "Hal Finkel" <hfinkel at anl.gov>
> >>>>>>> Cc: "Eric Christopher" <echristo at gmail.com>, "llvm-commits"
> >>>>>>> <llvm-commits at cs.uiuc.edu>, "Tobias Grosser"
> >>>>>>> <tobias at grosser.es>
> >>>>>>> Sent: Thursday, May 15, 2014 5:26:54 AM
> >>>>>>> Subject: Re: [PATCH] Add benchmarking-only mode to the test
> >>>>>>> suite
> >>>>>>>
> >>>>>>> Hi Hal Finkel,
> >>>>>>>
> >>>>>>> What's the criteria you use to to decide useful benchmarks?
> >>>>>>
> >>>>>>
> >>>>>> Please refer to the LLVMDev thread "[RFC] Benchmarking subset
> >>>>>> of
> >>>>>> the test suite" in which I explain my methadology in detail.
> >>>>>
> >>>>>
> >>>>> I think the approach you've taken is indeed sensible. However I
> >>>>> don't
> >>>>> really agree with your make -j6 option. The Xeon chip you are
> >>>>> testing
> >>>>> on only has 4 core, which means a lot of context switch
> >>>>> happens. The
> >>>>
> >>>>
> >>>> It is a dual-socket machine.
> >>>>
> >>>>> noise produced by that would be far too great for "normal"
> >>>>> environment. Also I believe that the testing machine should be
> >>>>> as
> >>>>> quiet as possible, otherwise we are actually measuring the
> >>>>> noise!
> >>>>
> >>>>
> >>>> This is obviously ideal, but rarely possible in practice. More
> >>>> to the
> >>>> point, the buildbots are not quiet, but we still want to be able
> >>>> to extract
> >>>> execution-time changes from them without a large number of false
> >>>> positives.
> >>>> Some tests are just too sensitive to I/O time, or are too short,
> >>>> for this to
> >>>> be possible (because you really are just seeing the noise), and
> >>>> this
> >>>> exclusion list is meant to exclude such test. Given a sufficient
> >>>> number of
> >>>> samples (10, for example), I've confirmed that it is possible to
> >>>> extract
> >>>> meaningful timing differences from the others at high
> >>>> confidence.
> >>>>
> >>>>>
> >>>>> I've been investigating the timeit tool in test suite. It turns
> >>>>> out
> >>>>> to
> >>>>> be really inaccurate, and sometimes it's the main source of
> >>>>> noise we
> >>>>> are seeing. I've implemented using Linux perf tool to measure
> >>>>> time.
> >>>>> So
> >>>>> far it seems to produce much better results. I will publish the
> >>>>> finding with the patch in a separate thread once I've gathered
> >>>>> enough
> >>>>> data points. Maybe with the more accurate timing tool, we might
> >>>>> get a
> >>>>> different picture.
> >>>>
> >>>>
> >>>> That's great!
> >>>>
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> I suggest you to also have a look at the standard deviation
> >>>>>>> or
> >>>>>>> MAD.
> >>>>>>
> >>>>>>
> >>>>>> Of course this has already been considered and taken into
> >>>>>> account
> >>>>>> ;)
> >>>>>>
> >>>>>>> Some of the tests have really large variance that we may not
> >>>>>>> want
> >>>>>>> to
> >>>>>>> include when benchmarking, eg.
> >>>>>>> Polybench/linear-algebra/kernels/3mm/3mm. I've attached a
> >>>>>>> patch
> >>>>>>> which
> >>>>>>> makes tables sortable so that it is easier to investigate.
> >>>>>>
> >>>>>>
> >>>>>> If you feel that there is a test or tests that have too large
> >>>>>> of a
> >>>>>> variance for useful benchmarking, please compose a list,
> >>>>>> explain
> >>>>>> your criteria, and we'll merge in some useful way.
> >>>>>
> >>>>>
> >>>>> Mainly Polybench/linear-algebra, but I can't give you the list
> >>>>> right
> >>>>> now as LLVM LNT site is down again.
> >>>
> >>>
> >>> These 5 tests have really large MAD on various testing machines,
> >>> even
> >>> with perf tools. Please add them the the exclusion list.
> >>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm
> >>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm
> >>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm
> >>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm
> >>
> >>
> >> This is interesting. Those benchmarks should in fact give reliable
> >> performance numbers (and they do so when I execute them). I just
> >> very
> >> briefly looked into this and my observation was that, if I pipe
> >> the output
> >> to a file or /dev/null, the gemm performance is always at the
> >> lower bound.
> >> Only if I run 'timeit' I see these spikes. I see similar spikes if
> >> I just
> >> print the output to the console.
> >>
> >> It would be great if we could understand where those spikes come
> >> from.
> >>
> >> Tobias
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory