[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Sun Jun 30 18:02:28 PDT 2013

This is probably another area where a bit of dynamic behavior could help.  When we find a regressions, kick off some runs to bisect back to where it manifests. This is what we would be doing manually anyway.  We could just search back with the set of regressing benchmarks, meaning the whole suite does not have to be run (unless it is a global regression).

There are situations where we see commit which make things slower then faster again, but so far those seem to be from experimental features being switched on then off.

The problem with the moving averages is they really don’t behave well when the benchmark is naturally bimodal.  One thing that LNT is doing to help “smooth” the results for you is by presenting the min of the data at a particular revision, which (hopefully) is approximating the actual runtime without noise.  That works well with a lot of samples per revision, but not for across revisions, where we really need the smoothing.   One way to explore this is to turn

Ignoring small regressions is an interesting problem.  Do it too many times, slowness creeps in.  But you are correct, no one wants to fix a small regression.  There is a bit of a value computation that we are all doing when we watch the results, which is not explicit in the software or documentation right now.  Mine is along the lines of: small regression in important benchmarks with certain flags matters, and bigger regressions in less important benchmarks and flags matter too, etc.

We also lack any way to coordinate or annotate regressions, that is a whole separate problem though.

Another idea I have been toying with is building a "change of interest" model, where we can explicitly tag particular revisions as impacting performance, then test them preferentially.  That could allow the effort to be focused to revisions where it might best have an effect.  I don’t know if that would play out well in reality though.

Chris Matthews
chris.matthews@.com
(408) 783-6335

On Jun 30, 2013, at 11:30 AM, Renato Golin <renato.golin at linaro.org> wrote:

> On 30 June 2013 10:14, Anton Korobeynikov <anton at korobeynikov.info> wrote:
> 1. Increasing sample size to at least 5-10
> 
> That's not feasible on slower systems. A single data point takes 1 hour on the fastest ARM board I can get (Chromebook). Getting 10 samples at different commits will give you similar accuracy if behaviour doesn't change, and you can rely on 10-point blocks before and after each change to have the same result.
> 
> What won't happen is one commit makes it truly faster and the very next slow again (or slow/fast), so all we need to measure is for each commit, if that was the one that made all next runs slower/faster, and that we can get with several commits after the culprit, since the probability that another (unrelated) commit will change the behaviour is small.
> 
> This is why I proposed something like moving averages. Not because it's the best statistical model, but because it works around a concrete problem we have. I don't care which model/tool you use, as long as it doesn't mean I'll have to wait 10 hours for a result, or sift through hundreds of commits every time I see a regression in performance. What that will do, for sure, is make me ignore small regressions, since they won't be worth the massive work to find the real culprit.
> 
> If I had a team of 10 people just to look at regressions all day long, I'd ask them to make a proper statistical model and go do more interesting things...
> 
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130630/7f34e330/attachment.html>