[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Thu Jun 27 11:14:31 PDT 2013

There are a few things we have looked at with LNT runs, so I will share the insights we have had so far. A lot of the problems we have are artificially created by our test protocols instead of the compiler changes themselves.  I have been doing a lot of large sample runs of single benchmarks to characterize them better.  Some key points:  

1) Some benchmarks are bi-modal or multi-modal, single means won’t describe these well
2) Some runs are pretty noisy and sometimes have very large single sample spikes
3) Most benchmarks don’t regress most of the time
4) Compile time is pretty stable metric, execution time not always

and depending on what you are using LNT for:

5) A regression is not really something to worry about unless it lasts for a while (some number of revisions or days or samples)
6) We also need to catch long slow regressions
7) Some of the “benchmarks” are really just correctness tests, and were not designed with repeatable measurement in mind.

As it stands now, we really can’t detect small regressions, slow regressions, and there are a lot of false positives.

There are two things I am working on right now to help make regression detection more reliable: adaptive sampling and cluster based regression flagging.

First, we need more samples per revision. But we really don’t have time to do —multisample=10 since that takes far too long.   The patch I am working on now and will submit soon, implements client side adaptive sampling based on server history.  Simply, it reruns benchmarks which are reported as regressed or improved.  The idea here being, if its going to to be flagged as a regression or improvement, get more data on those specific benchmarks to make sure that is the case.  Adaptive sampling should reduce the false positive regression flagging rate we see.  We are able to do this based on LNT’s provisional commit system. After a run, we submit all the results, but don’t commit them. The server reports the regressions, then we rerun the regressing benchmarks more times.  This gives us more data in the places where we need it most.  This has made a big difference on my local test machine.

As far as regression flagging goes, I have been working on a k-means discovery/clustering based approach to first come up with a set of means in the dataset, then characterize newer data based on that.  My hope is this can characterize multi-modal results, be resilient to short spikes and detect long term motion in the dataset.  I have this prototyped in LNT, but I am still trying to work out the best criteria to flag regression with. 

Probably obvious anyways but: since the LNT data is only as good as the setup it is run on, the other thing that has helped us is coming up with a set of best practices for running the benchmarks on a machine.  A machine which is “stable” produces much better results, but achiving this is more complex than not playing Starcraft while LNT is running.  You have to make sure power management is not mucking with clock rates, and that none of the magic backup/indexing/updating/networking/screensaver stuff on your machine is running.  In practice, I have seen a process using 50% of the CPU on 1 core of 8 move the stddev of a good benchmark +5%, and having 2 cores loaded on an 8 core machine trigger hundreds of regressions in LNT.

Chris Matthews
chris.matthews@.com
(408) 783-6335

On Jun 27, 2013, at 9:41 AM, Bob Wilson <bob.wilson at apple.com> wrote:

> 
> On Jun 27, 2013, at 9:27 AM, Renato Golin <renato.golin at linaro.org> wrote:
> 
>> On 27 June 2013 17:05, Tobias Grosser <tobias at grosser.es> wrote:
>> We are looking for a good way/value to show the reliability of individual results in the UI. Do you have some experience, what a good measure of the reliability of test results is?
>> 
>> Hi Tobi,
>> 
>> I had a look at this a while ago, but never got around to actually work on it. My idea was to never use point-changes as indication of progress/regressions, unless there was a significant change (2/3 sigma). What we should do is to compare the current moving-average with the past moving averages (of K runs) with both last-avg and the (N-K)th moving-average (to make sure previous values included in the current moving average are not toning it down/up), and keep the biggest difference as the final result.
>> 
>> We should also compare the current mov-avg with M non-overlapping mov-avgs before, and calculate if we're monotonically increasing, decreasing or if there is a difference of 2/3 sigma between the current mov-avg (N) and the (N-M)th mov-avg. That would give us an idea on the trends of each test.
> 
> Chris Matthews has recently been working on implementing something similar to that.  Chris, can you share some details?
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130627/0be3afcf/attachment.html>