[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

Chris Matthews chris.matthews at apple.com
Fri May 15 14:24:46 PDT 2015


tl;dr in low data situations we don’t look at past information, and that increases the false positive regression rate.  We should look at the possibly incorrect recent past runs to fix that.

Motivation: LNT’s current regression detection system has false positive rate that is too high to make it useful.  With test suites as large as the llvm “test-suite” a single report will show hundreds of regressions.  The false positive rate is so high the reports are ignored because it is impossible for a human to triage them, large performance problems are lost in the noise, small important regressions never even have a chance.  Later today I am going to commit a new unit test to LNT with 40 of my favorite regression patterns.  It has gems such as flat but noisy line, 5% regression in 5% noise, bimodal, and a slow increase, we fail to classify most of these correctly right now. They are not trick questions, all are obvious regressions or non-regressions, that are plainly visible. I want us to correctly classify them all!

Some context: LNTs regression detection algorithm as I understand it:

detect(current run’s samples, last runs samples) —> improve, regress or unchanged.

    # when recovering from errors performance should not be counted
    Current or last run failed -> unchanged

    delta = min(current samples) - min(prev samples)

    # too small to measure
    delta <  (confidence*machine noise threshold (0.0005s by default)) -> unchanged

    # too small to care
    delta % < 1% -> unchanged

    # too small to care
    delta < 0.01s -> unchanged

    if len(current samples) >= 4 && len(prev samples) >= 4
         Mann whitney U test -> possible unchanged

    #multisample, confidence interval check
    if len(current samples) > 1
           check delta within samples confidence interval -> if so, unchanged, else Improve, regress.

    # single sample,range check
    if len(current samples) == 1
        all % deltas above 1% improve or regress


The too small to care rules are newer inventions.

Effectiveness data: to see how well these rules work I ran a 14 machine, 7 day report:

- 16773 run comparisons
- 13852 marked unchanged because of small % delta
- 2603 unchanged because of small delta
- 0 unchanged because of Mann Whitney U test
- 0 unchanged because of confidence interval
- 318 improved or regressed because single sample change over 1% 

Real regressions: probably 1 or 2, not that I will click 318 links to check for sure… hence the motivation.

Observations: Most of the work is done by dropping small deltas.  Confidence intervals and Mann Whitney U tests are the tests we want to be triggering, however they only work with many samples. Even with reruns, most tests end up being a single sample.  LNT bots that a triggered after another build (unless using the multisample feature) just have one sample at each rev.  Multisample is not a good option because most runs already take a long time. 

Even with a small amount of predictable noise, if len(current samples) == 1, will flag a lot of samples, especially if len(prev) > 1.  Reruns actually make this worse by making it likely that we flag the next run after the run we rerun.  For instance, a flat line with 5% random noise flags all the time.

Besides the Mann Whitney U test, we are not using prev_samples in any way sane way.

Ideas: 

-Try and get more samples in as many places as possible.  Maybe —multisample=4 should be the default?  Make bots run more often (I have already done this on green dragon).

- Use recent past run information to enhance single sample regression detection.  I think we should add a lookback window, and model the recent past.  I tired a technique suggested by Mikhail Zolotukhin of computing delta as the smallest difference between current and all the previous samples.  It was far more effective.  Alternately we could try a confidence interval generated from previous, though that may not work on bimodal tests.

- Currently prev_samples is almost always just one other run, probably with only one sample itself.  Lets give this more samples to work with. Start passing more previous run data to all uses of the algorithm, in most places we intentionally limit the computation to current=run and previous=run-1, lets do something like previous=run-[1-10]. The risk in this approach is that regression noise in the look back window could trigger a false negative (we miss detecting a regression).  I think this is acceptable since we already miss lots of them because the reports are not actionable.

- Given the choice between false positive and false negative, lets err towards false negative.  We need to have manageable number of regressions detected or else we can’t act on them.

Any objections to me implementing these ideas?



More information about the llvm-dev mailing list