<div dir="ltr">Hi Chris,<div><br></div><div>Amazing that someone is finally looking at that with a proper background. You're much better equipped than I am to deal with that, so I'll trust you on your judgements, as I haven't paid much attention to benchmarks, more correctness. Some comments inline.</div>

<div class="gmail_extra"><br><br><div class="gmail_quote">On 27 June 2013 19:14, Chris Matthews <span dir="ltr"><<a href="mailto:chris.matthews@apple.com" target="_blank">chris.matthews@apple.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div><div>1) Some benchmarks are bi-modal or multi-modal, single means won’t describe these well</div>

</div></div></blockquote><div><br></div><div>True. My idea was to have a moving-"measurement", where the basic one being average, but others applied as well. It's possible that k-means can give you that, but I haven't understood what will be your vector space and distance measures to guess.</div>

<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div><div>2) Some runs are pretty noisy and sometimes have very large single sample spikes</div>

<div>3) Most benchmarks don’t regress most of the time</div></div></div></blockquote><div><br></div><div>Most of ARM benchmarks regress all the time because both the signal and the noise are in milliseconds, where machine and OS interference play a crucial part. But they don't regress with time, and they keep their average AND deviation for ever. So, if you can filter the noise on *all* benchmarks, it'd be great for ARM testing.</div>

<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div>5) A regression is not really something to worry about unless it lasts for a while (some number of revisions or days or samples)<br>

</div><div>6) We also need to catch long slow regressions</div></div></blockquote><div><br></div><div>Yup. Moving peak and trend.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div style="word-wrap:break-word"><div>7) Some of the “benchmarks” are really just correctness tests, and were not designed with repeatable measurement in mind.</div></div></blockquote><div><br></div><div>Yes. Would be great to move them to Application, and *not* time execution. Benchmarks are specifically designed to test execution time, applications aren't.</div>

<div><br></div><div>If we think an application is really important that we want to measure it, we should actively change it to a benchmark, making sure it's actually performing the core functionality on a repeatable way and with enough confidence that noise isn't playing a part on the numbers. Just throwing it and time execution will create a school of red herrings.</div>

<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div>After a run, we submit all the results, but don’t commit them. The server reports the regressions, then we rerun the regressing benchmarks more times.  This gives us more data in the places where we need it most.  This has made a big difference on my local test machine.<br>

</div></div></blockquote><div><br></div><div>This is a great idea, and I think it could improve things at a much lower cost. It won't replace decent benchmarking strategies on the software level, but it will reduce the noise, hopefully enough to allow other analysis to be successful at an early stage.</div>

<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div></div><div>As far as regression flagging goes, I have been working on a k-means discovery/clustering based approach to first come up with a set of means in the dataset, then characterize newer data based on that.  My hope is this can characterize multi-modal results, be resilient to short spikes and detect long term motion in the dataset.  I have this prototyped in LNT, but I am still trying to work out the best criteria to flag regression with. <br>

</div></div></blockquote><div><br></div><div>I'd like to understand that better (mostly for personal education). But it can be offline, if the rest of the list is not interested...</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div style="word-wrap:break-word"><div></div><div><div>You have to make sure power management is not mucking with clock rates, and that none of the magic backup/indexing/updating/networking/screensaver stuff on your machine is running.  In practice, I have seen a process using 50% of the CPU on 1 core of 8 move the stddev of a good benchmark +5%, and having 2 cores loaded on an 8 core machine trigger hundreds of regressions in LNT.<br>

</div></div></div></blockquote><div></div></div><br></div><div class="gmail_extra">I have seen this too. I think LNT has two modes: test and benchmark (not sure how to switch), but one tries to use all possible cores (unstable benchmarks) and the other runs using a single core all the way. I think we could assume that, for tests, we can use as much juice as we have available, and for benchmarks, we could use less than the total number of cores (the practical number can vary depending on the arch).</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">It's better to re-run some benchmarks 10 times, but use 8 CPUs than use only one...</div><div class="gmail_extra"><br></div><div class="gmail_extra">cheers,</div>

<div class="gmail_extra">--renato</div></div>