[LLVMdev] Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

Sean Silva chisophugis at gmail.com
Fri May 15 22:46:24 PDT 2015


Thanks for the info!

I actually scraped by (no pun intended) for a first look by scraping the
source of
http://llvm.org/perf/db_default/v4/nts/graph?hide_lineplot=yes&show_points=yes&moving_window_size=10&plot.0=21.1419.0&submit=Update#
which I happened to choose. I'll definitely take advantage of the JSON API
in the future!

Do you know how the individual data points correspond to the actual
samples? With "show all sample points" multiple points are shows for at
each revision tested, suggesting that the data is somewhere. Also, "show
all sample points" does not show the multiplicity at all, so it is hard to
tell what each point is.

If you're interested, the benchmark I looked at appears to have a similar
behavior to some of the ones you mentioned downthread, so you might find it
interesting:
https://drive.google.com/file/d/0B8v10qJ6EXRxdWFIN1dnYWVQWGc/view?usp=sharing
A lowpass differentiator with a cutoff of about Pi/10 seems to do a good
job for that machine's noise characteristics. The big bimodal jumps of
course need their own special cleanup before applying it though.
Also, it appears that the noise spectrum changes in ways that are not
correlated with the major performance changes... do you know if that
machine has had its configuration changed?
It would be useful to track some other data points along with just the
time, like total cycles, total insts retired, total icache misses, branch
prediction ratio, page faults, and maybe a few other things just so that we
have a few more degrees of freedom along which to look for verification.

-- Sean Silva


On Fri, May 15, 2015 at 8:46 PM, Chris Matthews <chris.matthews at apple.com>
wrote:

> Easiest way to get the data off llvm.org/perf is to use the json APIs.
> For many of the pages if you pass a &json=True LNT will give you a json
> reply.  For example, go to a run page, a click the check boxes next to a
> bunch of runs and click graph.  When all the run lines popup in the graph
> page, add json=True and it will download data for those tests on those
> machines.
>
> For example, on the O3 tester, all the benchmarks in
> Multisource/Applications:
>
>
> http://llvm.org/perf/db_default/v4/nts/graph?plot.1327=21.1327.0&plot.1053=21.1053.0&plot.1232=21.1232.0&plot.1483=21.1483.0&plot.1014=21.1014.0&plot.1138=21.1138.0&plot.1180=21.1180.0&plot.1288=21.1288.0&plot.1129=21.1129.0&plot.1425=21.1425.0&plot.1456=21.1456.0&plot.1038=21.1038.0&plot.1452=21.1452.0&plot.1166=21.1166.0&plot.1243=21.1243.0&plot.1116=21.1116.0&plot.1326=21.1326.0&plot.1279=21.1279.0&plot.1007=21.1007.0&plot.1394=21.1394.0&plot.1017=21.1017.0&plot.1443=21.1443.0&plot.1445=21.1445.0&plot.1197=21.1197.0&plot.1332=21.1332.0&json=True
>
> I have scripts for converting the json data into Python Pandas format,
> though the format is so simple you can really parse it with anything. I
> could contribute them if anyone would find them helpful. I know there are
> also scripts floating around for scraping all machines and runs from the
> LNT instance by first looking up the run and machine list, then fetching
> all the tests for each.
>
>
> On May 15, 2015, at 7:16 PM, Sean Silva <chisophugis at gmail.com> wrote:
>
> Is there a way to download the data off http://llvm.org/perf? I'd like to
> help with this but I don't have a good dataset to analyze.
>
> It definitely seems like the weakest part of the current and proposed
> scheme is that it only looks at two runs. That is basically useless when
> we're talking about only a handful of samples (<4???) per run. Since the
> machine's noise can be modeled from run to run (also sample to sample, but
> for simplicity just consider run to run) as a random process in the run
> number, all the techniques from digital filtering come into play. From
> looking at a couple of the graphs on LNT, the machine noise appears to be
> almost exclusively at Nyquist (i.e. it alternates from sample to sample)
> falling down to a bit at half Nyquist (I can analyze in more detail if I
> can get my hands on the data). We probably want a lowpass differentiator at
> about half Nyquist.
>
> I would strongly recommend starting with a single benchmark on a single
> machine and coming up with detection routine just for it that is basically
> 100% accurate, then generalizing as appropriate so that you are getting
> reliable coverage of a larger portion of the benchmarks. The machine's
> noise is probably easiest to characterize and most generalizable across
> runs.
>
>
> -- Sean Silva
>
> On Fri, May 15, 2015 at 2:24 PM, Chris Matthews <chris.matthews at apple.com>
> wrote:
>
>> tl;dr in low data situations we don’t look at past information, and that
>> increases the false positive regression rate.  We should look at the
>> possibly incorrect recent past runs to fix that.
>>
>> Motivation: LNT’s current regression detection system has false positive
>> rate that is too high to make it useful.  With test suites as large as the
>> llvm “test-suite” a single report will show hundreds of regressions.  The
>> false positive rate is so high the reports are ignored because it is
>> impossible for a human to triage them, large performance problems are lost
>> in the noise, small important regressions never even have a chance.  Later
>> today I am going to commit a new unit test to LNT with 40 of my favorite
>> regression patterns.  It has gems such as flat but noisy line, 5%
>> regression in 5% noise, bimodal, and a slow increase, we fail to classify
>> most of these correctly right now. They are not trick questions, all are
>> obvious regressions or non-regressions, that are plainly visible. I want us
>> to correctly classify them all!
>>
>> Some context: LNTs regression detection algorithm as I understand it:
>>
>> detect(current run’s samples, last runs samples) —> improve, regress or
>> unchanged.
>>
>>     # when recovering from errors performance should not be counted
>>     Current or last run failed -> unchanged
>>
>>     delta = min(current samples) - min(prev samples)
>>
>>     # too small to measure
>>     delta <  (confidence*machine noise threshold (0.0005s by default)) ->
>> unchanged
>>
>>     # too small to care
>>     delta % < 1% -> unchanged
>>
>>     # too small to care
>>     delta < 0.01s -> unchanged
>>
>>     if len(current samples) >= 4 && len(prev samples) >= 4
>>          Mann whitney U test -> possible unchanged
>>
>>     #multisample, confidence interval check
>>     if len(current samples) > 1
>>            check delta within samples confidence interval -> if so,
>> unchanged, else Improve, regress.
>>
>>     # single sample,range check
>>     if len(current samples) == 1
>>         all % deltas above 1% improve or regress
>>
>>
>> The too small to care rules are newer inventions.
>>
>> Effectiveness data: to see how well these rules work I ran a 14 machine,
>> 7 day report:
>>
>> - 16773 run comparisons
>> - 13852 marked unchanged because of small % delta
>> - 2603 unchanged because of small delta
>> - 0 unchanged because of Mann Whitney U test
>> - 0 unchanged because of confidence interval
>> - 318 improved or regressed because single sample change over 1%
>>
>> Real regressions: probably 1 or 2, not that I will click 318 links to
>> check for sure… hence the motivation.
>>
>> Observations: Most of the work is done by dropping small deltas.
>> Confidence intervals and Mann Whitney U tests are the tests we want to be
>> triggering, however they only work with many samples. Even with reruns,
>> most tests end up being a single sample.  LNT bots that a triggered after
>> another build (unless using the multisample feature) just have one sample
>> at each rev.  Multisample is not a good option because most runs already
>> take a long time.
>>
>> Even with a small amount of predictable noise, if len(current samples) ==
>> 1, will flag a lot of samples, especially if len(prev) > 1.  Reruns
>> actually make this worse by making it likely that we flag the next run
>> after the run we rerun.  For instance, a flat line with 5% random noise
>> flags all the time.
>>
>> Besides the Mann Whitney U test, we are not using prev_samples in any way
>> sane way.
>>
>> Ideas:
>>
>> -Try and get more samples in as many places as possible.  Maybe
>> —multisample=4 should be the default?  Make bots run more often (I have
>> already done this on green dragon).
>>
>> - Use recent past run information to enhance single sample regression
>> detection.  I think we should add a lookback window, and model the recent
>> past.  I tired a technique suggested by Mikhail Zolotukhin of computing
>> delta as the smallest difference between current and all the previous
>> samples.  It was far more effective.  Alternately we could try a confidence
>> interval generated from previous, though that may not work on bimodal tests.
>>
>> - Currently prev_samples is almost always just one other run, probably
>> with only one sample itself.  Lets give this more samples to work with.
>> Start passing more previous run data to all uses of the algorithm, in most
>> places we intentionally limit the computation to current=run and
>> previous=run-1, lets do something like previous=run-[1-10]. The risk in
>> this approach is that regression noise in the look back window could
>> trigger a false negative (we miss detecting a regression).  I think this is
>> acceptable since we already miss lots of them because the reports are not
>> actionable.
>>
>> - Given the choice between false positive and false negative, lets err
>> towards false negative.  We need to have manageable number of regressions
>> detected or else we can’t act on them.
>>
>> Any objections to me implementing these ideas?
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150515/07dd5d4f/attachment.html>


More information about the llvm-dev mailing list