[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

Chris Matthews chris.matthews at apple.com
Fri May 15 20:46:44 PDT 2015


Easiest way to get the data off llvm.org/perf <http://llvm.org/perf> is to use the json APIs.  For many of the pages if you pass a &json=True LNT will give you a json reply.  For example, go to a run page, a click the check boxes next to a bunch of runs and click graph.  When all the run lines popup in the graph page, add json=True and it will download data for those tests on those machines.

For example, on the O3 tester, all the benchmarks in Multisource/Applications:

http://llvm.org/perf/db_default/v4/nts/graph?plot.1327=21.1327.0&plot.1053=21.1053.0&plot.1232=21.1232.0&plot.1483=21.1483.0&plot.1014=21.1014.0&plot.1138=21.1138.0&plot.1180=21.1180.0&plot.1288=21.1288.0&plot.1129=21.1129.0&plot.1425=21.1425.0&plot.1456=21.1456.0&plot.1038=21.1038.0&plot.1452=21.1452.0&plot.1166=21.1166.0&plot.1243=21.1243.0&plot.1116=21.1116.0&plot.1326=21.1326.0&plot.1279=21.1279.0&plot.1007=21.1007.0&plot.1394=21.1394.0&plot.1017=21.1017.0&plot.1443=21.1443.0&plot.1445=21.1445.0&plot.1197=21.1197.0&plot.1332=21.1332.0&json=True

I have scripts for converting the json data into Python Pandas format, though the format is so simple you can really parse it with anything. I could contribute them if anyone would find them helpful. I know there are also scripts floating around for scraping all machines and runs from the LNT instance by first looking up the run and machine list, then fetching all the tests for each.


> On May 15, 2015, at 7:16 PM, Sean Silva <chisophugis at gmail.com> wrote:
> 
> Is there a way to download the data off http://llvm.org/perf <http://llvm.org/perf>? I'd like to help with this but I don't have a good dataset to analyze.
> 
> It definitely seems like the weakest part of the current and proposed scheme is that it only looks at two runs. That is basically useless when we're talking about only a handful of samples (<4???) per run. Since the machine's noise can be modeled from run to run (also sample to sample, but for simplicity just consider run to run) as a random process in the run number, all the techniques from digital filtering come into play. From looking at a couple of the graphs on LNT, the machine noise appears to be almost exclusively at Nyquist (i.e. it alternates from sample to sample) falling down to a bit at half Nyquist (I can analyze in more detail if I can get my hands on the data). We probably want a lowpass differentiator at about half Nyquist.
> 
> I would strongly recommend starting with a single benchmark on a single machine and coming up with detection routine just for it that is basically 100% accurate, then generalizing as appropriate so that you are getting reliable coverage of a larger portion of the benchmarks. The machine's noise is probably easiest to characterize and most generalizable across runs.
> 
> 
> -- Sean Silva
> 
> On Fri, May 15, 2015 at 2:24 PM, Chris Matthews <chris.matthews at apple.com <mailto:chris.matthews at apple.com>> wrote:
> tl;dr in low data situations we don’t look at past information, and that increases the false positive regression rate.  We should look at the possibly incorrect recent past runs to fix that.
> 
> Motivation: LNT’s current regression detection system has false positive rate that is too high to make it useful.  With test suites as large as the llvm “test-suite” a single report will show hundreds of regressions.  The false positive rate is so high the reports are ignored because it is impossible for a human to triage them, large performance problems are lost in the noise, small important regressions never even have a chance.  Later today I am going to commit a new unit test to LNT with 40 of my favorite regression patterns.  It has gems such as flat but noisy line, 5% regression in 5% noise, bimodal, and a slow increase, we fail to classify most of these correctly right now. They are not trick questions, all are obvious regressions or non-regressions, that are plainly visible. I want us to correctly classify them all!
> 
> Some context: LNTs regression detection algorithm as I understand it:
> 
> detect(current run’s samples, last runs samples) —> improve, regress or unchanged.
> 
>     # when recovering from errors performance should not be counted
>     Current or last run failed -> unchanged
> 
>     delta = min(current samples) - min(prev samples)
> 
>     # too small to measure
>     delta <  (confidence*machine noise threshold (0.0005s by default)) -> unchanged
> 
>     # too small to care
>     delta % < 1% -> unchanged
> 
>     # too small to care
>     delta < 0.01s -> unchanged
> 
>     if len(current samples) >= 4 && len(prev samples) >= 4
>          Mann whitney U test -> possible unchanged
> 
>     #multisample, confidence interval check
>     if len(current samples) > 1
>            check delta within samples confidence interval -> if so, unchanged, else Improve, regress.
> 
>     # single sample,range check
>     if len(current samples) == 1
>         all % deltas above 1% improve or regress
> 
> 
> The too small to care rules are newer inventions.
> 
> Effectiveness data: to see how well these rules work I ran a 14 machine, 7 day report:
> 
> - 16773 run comparisons
> - 13852 marked unchanged because of small % delta
> - 2603 unchanged because of small delta
> - 0 unchanged because of Mann Whitney U test
> - 0 unchanged because of confidence interval
> - 318 improved or regressed because single sample change over 1%
> 
> Real regressions: probably 1 or 2, not that I will click 318 links to check for sure… hence the motivation.
> 
> Observations: Most of the work is done by dropping small deltas.  Confidence intervals and Mann Whitney U tests are the tests we want to be triggering, however they only work with many samples. Even with reruns, most tests end up being a single sample.  LNT bots that a triggered after another build (unless using the multisample feature) just have one sample at each rev.  Multisample is not a good option because most runs already take a long time.
> 
> Even with a small amount of predictable noise, if len(current samples) == 1, will flag a lot of samples, especially if len(prev) > 1.  Reruns actually make this worse by making it likely that we flag the next run after the run we rerun.  For instance, a flat line with 5% random noise flags all the time.
> 
> Besides the Mann Whitney U test, we are not using prev_samples in any way sane way.
> 
> Ideas:
> 
> -Try and get more samples in as many places as possible.  Maybe —multisample=4 should be the default?  Make bots run more often (I have already done this on green dragon).
> 
> - Use recent past run information to enhance single sample regression detection.  I think we should add a lookback window, and model the recent past.  I tired a technique suggested by Mikhail Zolotukhin of computing delta as the smallest difference between current and all the previous samples.  It was far more effective.  Alternately we could try a confidence interval generated from previous, though that may not work on bimodal tests.
> 
> - Currently prev_samples is almost always just one other run, probably with only one sample itself.  Lets give this more samples to work with. Start passing more previous run data to all uses of the algorithm, in most places we intentionally limit the computation to current=run and previous=run-1, lets do something like previous=run-[1-10]. The risk in this approach is that regression noise in the look back window could trigger a false negative (we miss detecting a regression).  I think this is acceptable since we already miss lots of them because the reports are not actionable.
> 
> - Given the choice between false positive and false negative, lets err towards false negative.  We need to have manageable number of regressions detected or else we can’t act on them.
> 
> Any objections to me implementing these ideas?
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>         http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150515/aa5b3356/attachment.html>


More information about the llvm-dev mailing list