[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

Sean Silva chisophugis at gmail.com
Fri May 15 19:16:25 PDT 2015


Is there a way to download the data off http://llvm.org/perf? I'd like to
help with this but I don't have a good dataset to analyze.

It definitely seems like the weakest part of the current and proposed
scheme is that it only looks at two runs. That is basically useless when
we're talking about only a handful of samples (<4???) per run. Since the
machine's noise can be modeled from run to run (also sample to sample, but
for simplicity just consider run to run) as a random process in the run
number, all the techniques from digital filtering come into play. From
looking at a couple of the graphs on LNT, the machine noise appears to be
almost exclusively at Nyquist (i.e. it alternates from sample to sample)
falling down to a bit at half Nyquist (I can analyze in more detail if I
can get my hands on the data). We probably want a lowpass differentiator at
about half Nyquist.

I would strongly recommend starting with a single benchmark on a single
machine and coming up with detection routine just for it that is basically
100% accurate, then generalizing as appropriate so that you are getting
reliable coverage of a larger portion of the benchmarks. The machine's
noise is probably easiest to characterize and most generalizable across
runs.


-- Sean Silva

On Fri, May 15, 2015 at 2:24 PM, Chris Matthews <chris.matthews at apple.com>
wrote:

> tl;dr in low data situations we don’t look at past information, and that
> increases the false positive regression rate.  We should look at the
> possibly incorrect recent past runs to fix that.
>
> Motivation: LNT’s current regression detection system has false positive
> rate that is too high to make it useful.  With test suites as large as the
> llvm “test-suite” a single report will show hundreds of regressions.  The
> false positive rate is so high the reports are ignored because it is
> impossible for a human to triage them, large performance problems are lost
> in the noise, small important regressions never even have a chance.  Later
> today I am going to commit a new unit test to LNT with 40 of my favorite
> regression patterns.  It has gems such as flat but noisy line, 5%
> regression in 5% noise, bimodal, and a slow increase, we fail to classify
> most of these correctly right now. They are not trick questions, all are
> obvious regressions or non-regressions, that are plainly visible. I want us
> to correctly classify them all!
>
> Some context: LNTs regression detection algorithm as I understand it:
>
> detect(current run’s samples, last runs samples) —> improve, regress or
> unchanged.
>
>     # when recovering from errors performance should not be counted
>     Current or last run failed -> unchanged
>
>     delta = min(current samples) - min(prev samples)
>
>     # too small to measure
>     delta <  (confidence*machine noise threshold (0.0005s by default)) ->
> unchanged
>
>     # too small to care
>     delta % < 1% -> unchanged
>
>     # too small to care
>     delta < 0.01s -> unchanged
>
>     if len(current samples) >= 4 && len(prev samples) >= 4
>          Mann whitney U test -> possible unchanged
>
>     #multisample, confidence interval check
>     if len(current samples) > 1
>            check delta within samples confidence interval -> if so,
> unchanged, else Improve, regress.
>
>     # single sample,range check
>     if len(current samples) == 1
>         all % deltas above 1% improve or regress
>
>
> The too small to care rules are newer inventions.
>
> Effectiveness data: to see how well these rules work I ran a 14 machine, 7
> day report:
>
> - 16773 run comparisons
> - 13852 marked unchanged because of small % delta
> - 2603 unchanged because of small delta
> - 0 unchanged because of Mann Whitney U test
> - 0 unchanged because of confidence interval
> - 318 improved or regressed because single sample change over 1%
>
> Real regressions: probably 1 or 2, not that I will click 318 links to
> check for sure… hence the motivation.
>
> Observations: Most of the work is done by dropping small deltas.
> Confidence intervals and Mann Whitney U tests are the tests we want to be
> triggering, however they only work with many samples. Even with reruns,
> most tests end up being a single sample.  LNT bots that a triggered after
> another build (unless using the multisample feature) just have one sample
> at each rev.  Multisample is not a good option because most runs already
> take a long time.
>
> Even with a small amount of predictable noise, if len(current samples) ==
> 1, will flag a lot of samples, especially if len(prev) > 1.  Reruns
> actually make this worse by making it likely that we flag the next run
> after the run we rerun.  For instance, a flat line with 5% random noise
> flags all the time.
>
> Besides the Mann Whitney U test, we are not using prev_samples in any way
> sane way.
>
> Ideas:
>
> -Try and get more samples in as many places as possible.  Maybe
> —multisample=4 should be the default?  Make bots run more often (I have
> already done this on green dragon).
>
> - Use recent past run information to enhance single sample regression
> detection.  I think we should add a lookback window, and model the recent
> past.  I tired a technique suggested by Mikhail Zolotukhin of computing
> delta as the smallest difference between current and all the previous
> samples.  It was far more effective.  Alternately we could try a confidence
> interval generated from previous, though that may not work on bimodal tests.
>
> - Currently prev_samples is almost always just one other run, probably
> with only one sample itself.  Lets give this more samples to work with.
> Start passing more previous run data to all uses of the algorithm, in most
> places we intentionally limit the computation to current=run and
> previous=run-1, lets do something like previous=run-[1-10]. The risk in
> this approach is that regression noise in the look back window could
> trigger a false negative (we miss detecting a regression).  I think this is
> acceptable since we already miss lots of them because the reports are not
> actionable.
>
> - Given the choice between false positive and false negative, lets err
> towards false negative.  We need to have manageable number of regressions
> detected or else we can’t act on them.
>
> Any objections to me implementing these ideas?
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150515/c40c77fa/attachment.html>


More information about the llvm-dev mailing list