[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Renato Golin renato.golin at linaro.org
Thu Jun 27 12:04:45 PDT 2013


Hi Chris,

Amazing that someone is finally looking at that with a proper background.
You're much better equipped than I am to deal with that, so I'll trust you
on your judgements, as I haven't paid much attention to benchmarks, more
correctness. Some comments inline.


On 27 June 2013 19:14, Chris Matthews <chris.matthews at apple.com> wrote:

> 1) Some benchmarks are bi-modal or multi-modal, single means won’t
> describe these well
>

True. My idea was to have a moving-"measurement", where the basic one being
average, but others applied as well. It's possible that k-means can give
you that, but I haven't understood what will be your vector space and
distance measures to guess.


2) Some runs are pretty noisy and sometimes have very large single sample
> spikes
> 3) Most benchmarks don’t regress most of the time
>

Most of ARM benchmarks regress all the time because both the signal and the
noise are in milliseconds, where machine and OS interference play a crucial
part. But they don't regress with time, and they keep their average AND
deviation for ever. So, if you can filter the noise on *all* benchmarks,
it'd be great for ARM testing.


5) A regression is not really something to worry about unless it lasts for
> a while (some number of revisions or days or samples)
> 6) We also need to catch long slow regressions
>

Yup. Moving peak and trend.


7) Some of the “benchmarks” are really just correctness tests, and were not
> designed with repeatable measurement in mind.
>

Yes. Would be great to move them to Application, and *not* time execution.
Benchmarks are specifically designed to test execution time, applications
aren't.

If we think an application is really important that we want to measure it,
we should actively change it to a benchmark, making sure it's actually
performing the core functionality on a repeatable way and with enough
confidence that noise isn't playing a part on the numbers. Just throwing it
and time execution will create a school of red herrings.


After a run, we submit all the results, but don’t commit them. The server
> reports the regressions, then we rerun the regressing benchmarks more
> times.  This gives us more data in the places where we need it most.  This
> has made a big difference on my local test machine.
>

This is a great idea, and I think it could improve things at a much lower
cost. It won't replace decent benchmarking strategies on the software
level, but it will reduce the noise, hopefully enough to allow other
analysis to be successful at an early stage.


As far as regression flagging goes, I have been working on a k-means
> discovery/clustering based approach to first come up with a set of means in
> the dataset, then characterize newer data based on that.  My hope is this
> can characterize multi-modal results, be resilient to short spikes and
> detect long term motion in the dataset.  I have this prototyped in LNT, but
> I am still trying to work out the best criteria to flag regression with.
>

I'd like to understand that better (mostly for personal education). But it
can be offline, if the rest of the list is not interested...


You have to make sure power management is not mucking with clock rates, and
> that none of the magic backup/indexing/updating/networking/screensaver
> stuff on your machine is running.  In practice, I have seen a process using
> 50% of the CPU on 1 core of 8 move the stddev of a good benchmark +5%, and
> having 2 cores loaded on an 8 core machine trigger hundreds of regressions
> in LNT.
>

I have seen this too. I think LNT has two modes: test and benchmark (not
sure how to switch), but one tries to use all possible cores (unstable
benchmarks) and the other runs using a single core all the way. I think we
could assume that, for tests, we can use as much juice as we have
available, and for benchmarks, we could use less than the total number of
cores (the practical number can vary depending on the arch).

It's better to re-run some benchmarks 10 times, but use 8 CPUs than use
only one...

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130627/904b76b9/attachment.html>


More information about the llvm-dev mailing list