[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Fri Jun 28 02:28:16 PDT 2013

First, we need more samples per revision. But we really don’t have time to do —multisample=10 since that takes far too long.   The patch I am working on now and will submit soon, implements client side adaptive sampling based on server history.  Simply, it reruns benchmarks which are reported as regressed or improved.  The idea here being, if its going to to be flagged as a regression or improvement, get more data on those specific benchmarks to make sure that is the case.  Adaptive sampling should reduce the false positive regression flagging rate we see.  We are able to do this based on LNT’s provisional commit system. After a run, we submit all the results, but don’t commit them. The server reports the regressions, then we rerun the regressing benchmarks more times.  This gives us more data in the places where we need it most.  This has made a big difference on my local test machine.

| As far as regression flagging goes, I have been working on a k-means discovery/clustering based approach to first come up with a set of means in the dataset, then characterize newer data based on that.  My hope is this can characterize multi-modal results,

| be resilient to short spikes and detect long term motion in the dataset.  I have this prototyped in LNT, but I am still trying to work out the best criteria to flag regression with. 

Basic question: I'm imagining the volume of data being dealt with isn't that large (as statistical datasets go) and you're discarding old values anyway (since we care if we're regressing wrt now rather than LLVM 1.1), so can't you just build a kernel density estimator of the "baseline" runtime and then estimate the probabilities that samples from a given codebase are going to happening "slower" than the baseline? I suppose the drawback to not explicitly modelling the modes (with all its complications and tunings) is that you can't attempt to determine when a value is bigger than a lower cluster, even though it's smaller than the bigger cluster and estimate if it's evidence of a slowdown within the small cluster regime. Still that seems a bit complicated to do automatically.

(Inicidentally, responding to the earlier email below, I think you don't really want to compare moving averages but use some statistical test to quantify if the separation between the set of points within the "earlier window" are statistically significantly higher than the "later window"; all moving averages do is smear out useful information which can be useful if you've just got far too many data points, but otherwise it doesn't really help.

Cheers,

Dave

Probably obvious anyways but: since the LNT data is only as good as the setup it is run on, the other thing that has helped us is coming up with a set of best practices for running the benchmarks on a machine.  A machine which is “stable” produces much better results, but achiving this is more complex than not playing Starcraft while LNT is running.  You have to make sure power management is not mucking with clock rates, and that none of the magic backup/indexing/updating/networking/screensaver stuff on your machine is running.  In practice, I have seen a process using 50% of the CPU on 1 core of 8 move the stddev of a good benchmark +5%, and having 2 cores loaded on an 8 core machine trigger hundreds of regressions in LNT.

Chris Matthews
chris.matthews@.com
(408) 783-6335

On Jun 27, 2013, at 9:41 AM, Bob Wilson <bob.wilson at apple.com> wrote:

On Jun 27, 2013, at 9:27 AM, Renato Golin <renato.golin at linaro.org> wrote:

On 27 June 2013 17:05, Tobias Grosser <tobias at grosser.es> wrote:

We are looking for a good way/value to show the reliability of individual results in the UI. Do you have some experience, what a good measure of the reliability of test results is?

Hi Tobi,

I had a look at this a while ago, but never got around to actually work on it. My idea was to never use point-changes as indication of progress/regressions, unless there was a significant change (2/3 sigma). What we should do is to compare the current moving-average with the past moving averages (of K runs) with both last-avg and the (N-K)th moving-average (to make sure previous values included in the current moving average are not toning it down/up), and keep the biggest difference as the final result.

We should also compare the current mov-avg with M non-overlapping mov-avgs before, and calculate if we're monotonically increasing, decreasing or if there is a difference of 2/3 sigma between the current mov-avg (N) and the (N-M)th mov-avg. That would give us an idea on the trends of each test.

Chris Matthews has recently been working on implementing something similar to that.  Chris, can you share some details?

_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> 
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130628/e71616fa/attachment.html>