[PATCH] Add benchmarking-only mode to the test suite

Tue May 20 07:12:58 PDT 2014

Hi,

Out of interest, given the major churn going on in the test-suite at the
moment, is now the right time to discuss how best to replace the utterly
archaic and incomprehensible makefile system?

Cheers,

James

On 20 May 2014 14:54, Yi Kong <kongy.dev at gmail.com> wrote:

> It's a really strange test case. The spikes disappear when I measure
> cache/TLB misses using perf.
>
> On 20 May 2014 13:47, Hal Finkel <hfinkel at anl.gov> wrote:
> > ----- Original Message -----
> >> From: "Yi Kong" <kongy.dev at gmail.com>
> >> To: "Tobias Grosser" <tobias at grosser.es>
> >> Cc: "Hal Finkel" <hfinkel at anl.gov>, "Eric Christopher" <
> echristo at gmail.com>, "llvm-commits"
> >> <llvm-commits at cs.uiuc.edu>
> >> Sent: Tuesday, May 20, 2014 7:11:27 AM
> >> Subject: Re: [PATCH] Add benchmarking-only mode to the test suite
> >>
> >> Tobias, I can't reproduce your findings on my machine. Even if I
> >> disabled output(removing -DPOLYBENCH_DUMP_ARRAYS) and piped to
> >> /dev/null, I still get lots of spikes. I think we need to exclude
> >> those tests until we find out how to stabilize those results.
> >
> > Okay, I'll also exclude them for now. How large is the working set?
> Could you be seeing TLB misses?
> >
> >  -Hal
> >
> >>
> >> On 18 May 2014 12:08, Yi Kong <kongy.dev at gmail.com> wrote:
> >> > I think that's due to the vast amount of output it produces. Maybe
> >> > replacing the output with an accumulator with give a more stable
> >> > result?
> >> >
> >> > On 17 May 2014 22:34, Tobias Grosser <tobias at grosser.es> wrote:
> >> >> On 17/05/2014 14:08, Yi Kong wrote:
> >> >>>
> >> >>> On 16 May 2014 15:25, Hal Finkel <hfinkel at anl.gov> wrote:
> >> >>>>
> >> >>>> ----- Original Message -----
> >> >>>>>
> >> >>>>> From: "Yi Kong" <kongy.dev at gmail.com>
> >> >>>>> To: "Hal Finkel" <hfinkel at anl.gov>
> >> >>>>> Cc: "Eric Christopher" <echristo at gmail.com>, "llvm-commits"
> >> >>>>> <llvm-commits at cs.uiuc.edu>, "Tobias Grosser"
> >> >>>>> <tobias at grosser.es>
> >> >>>>> Sent: Thursday, May 15, 2014 5:41:04 PM
> >> >>>>> Subject: Re: [PATCH] Add benchmarking-only mode to the test
> >> >>>>> suite
> >> >>>>>
> >> >>>>> On 15 May 2014 13:59, Hal Finkel <hfinkel at anl.gov> wrote:
> >> >>>>>>
> >> >>>>>> ----- Original Message -----
> >> >>>>>>>
> >> >>>>>>> From: "Yi Kong" <kongy.dev at gmail.com>
> >> >>>>>>> To: "Hal Finkel" <hfinkel at anl.gov>
> >> >>>>>>> Cc: "Eric Christopher" <echristo at gmail.com>, "llvm-commits"
> >> >>>>>>> <llvm-commits at cs.uiuc.edu>, "Tobias Grosser"
> >> >>>>>>> <tobias at grosser.es>
> >> >>>>>>> Sent: Thursday, May 15, 2014 5:26:54 AM
> >> >>>>>>> Subject: Re: [PATCH] Add benchmarking-only mode to the test
> >> >>>>>>> suite
> >> >>>>>>>
> >> >>>>>>> Hi Hal Finkel,
> >> >>>>>>>
> >> >>>>>>> What's the criteria you use to to decide useful benchmarks?
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Please refer to the LLVMDev thread "[RFC] Benchmarking subset
> >> >>>>>> of
> >> >>>>>> the test suite" in which I explain my methadology in detail.
> >> >>>>>
> >> >>>>>
> >> >>>>> I think the approach you've taken is indeed sensible. However I
> >> >>>>> don't
> >> >>>>> really agree with your make -j6 option. The Xeon chip you are
> >> >>>>> testing
> >> >>>>> on only has 4 core, which means a lot of context switch
> >> >>>>> happens. The
> >> >>>>
> >> >>>>
> >> >>>> It is a dual-socket machine.
> >> >>>>
> >> >>>>> noise produced by that would be far too great for "normal"
> >> >>>>> environment. Also I believe that the testing machine should be
> >> >>>>> as
> >> >>>>> quiet as possible, otherwise we are actually measuring the
> >> >>>>> noise!
> >> >>>>
> >> >>>>
> >> >>>> This is obviously ideal, but rarely possible in practice. More
> >> >>>> to the
> >> >>>> point, the buildbots are not quiet, but we still want to be able
> >> >>>> to extract
> >> >>>> execution-time changes from them without a large number of false
> >> >>>> positives.
> >> >>>> Some tests are just too sensitive to I/O time, or are too short,
> >> >>>> for this to
> >> >>>> be possible (because you really are just seeing the noise), and
> >> >>>> this
> >> >>>> exclusion list is meant to exclude such test. Given a sufficient
> >> >>>> number of
> >> >>>> samples (10, for example), I've confirmed that it is possible to
> >> >>>> extract
> >> >>>> meaningful timing differences from the others at high
> >> >>>> confidence.
> >> >>>>
> >> >>>>>
> >> >>>>> I've been investigating the timeit tool in test suite. It turns
> >> >>>>> out
> >> >>>>> to
> >> >>>>> be really inaccurate, and sometimes it's the main source of
> >> >>>>> noise we
> >> >>>>> are seeing. I've implemented using Linux perf tool to measure
> >> >>>>> time.
> >> >>>>> So
> >> >>>>> far it seems to produce much better results. I will publish the
> >> >>>>> finding with the patch in a separate thread once I've gathered
> >> >>>>> enough
> >> >>>>> data points. Maybe with the more accurate timing tool, we might
> >> >>>>> get a
> >> >>>>> different picture.
> >> >>>>
> >> >>>>
> >> >>>> That's great!
> >> >>>>
> >> >>>>>
> >> >>>>>>
> >> >>>>>>>
> >> >>>>>>> I suggest you to also have a look at the standard deviation
> >> >>>>>>> or
> >> >>>>>>> MAD.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Of course this has already been considered and taken into
> >> >>>>>> account
> >> >>>>>> ;)
> >> >>>>>>
> >> >>>>>>> Some of the tests have really large variance that we may not
> >> >>>>>>> want
> >> >>>>>>> to
> >> >>>>>>> include when benchmarking, eg.
> >> >>>>>>> Polybench/linear-algebra/kernels/3mm/3mm. I've attached a
> >> >>>>>>> patch
> >> >>>>>>> which
> >> >>>>>>> makes tables sortable so that it is easier to investigate.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> If you feel that there is a test or tests that have too large
> >> >>>>>> of a
> >> >>>>>> variance for useful benchmarking, please compose a list,
> >> >>>>>> explain
> >> >>>>>> your criteria, and we'll merge in some useful way.
> >> >>>>>
> >> >>>>>
> >> >>>>> Mainly Polybench/linear-algebra, but I can't give you the list
> >> >>>>> right
> >> >>>>> now as LLVM LNT site is down again.
> >> >>>
> >> >>>
> >> >>> These 5 tests have really large MAD on various testing machines,
> >> >>> even
> >> >>> with perf tools. Please add them the the exclusion list.
> >> >>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm
> >> >>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm
> >> >>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm
> >> >>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm
> >> >>
> >> >>
> >> >> This is interesting. Those benchmarks should in fact give reliable
> >> >> performance numbers (and they do so when I execute them). I just
> >> >> very
> >> >> briefly looked into this and my observation was that, if I pipe
> >> >> the output
> >> >> to a file or /dev/null, the gemm performance is always at the
> >> >> lower bound.
> >> >> Only if I run 'timeit' I see these spikes. I see similar spikes if
> >> >> I just
> >> >> print the output to the console.
> >> >>
> >> >> It would be great if we could understand where those spikes come
> >> >> from.
> >> >>
> >> >> Tobias
> >>
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140520/aeefab29/attachment.html>