<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=utf-8"><meta name=Generator content="Microsoft Word 12 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
{font-family:Helvetica;
panose-1:2 11 6 4 2 2 2 2 2 4;}
@font-face
{font-family:Helvetica;
panose-1:2 11 6 4 2 2 2 2 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
span.apple-converted-space
{mso-style-name:apple-converted-space;}
span.EmailStyle18
{mso-style-type:personal-reply;
font-family:"Calibri","sans-serif";
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-GB link=blue vlink=purple><div class=WordSection1><div><div><div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'><o:p> </o:p></span></p></div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'>First, we need more samples per revision. But we really don’t have time to do —multisample=10 since that takes far too long. The patch I am working on now and will submit soon, implements client side adaptive sampling based on server history. Simply, it reruns benchmarks which are reported as regressed or improved. The idea here being, if its going to to be flagged as a regression or improvement, get more data on those specific benchmarks to make sure that is the case. Adaptive sampling should reduce the false positive regression flagging rate we see. We are able to do this based on LNT’s provisional commit system. After a run, we submit all the results, but don’t commit them. The server reports the regressions, then we rerun the regressing benchmarks more times. This gives us more data in the places where we need it most. This has made a big difference on my local test machine.<o:p></o:p></span></p></div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'><o:p> </o:p></span></p></div><div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif";color:#1F497D'>| </span><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'>As far as regression flagging goes, I have been working on a k-means discovery/clustering based approach to first come up with a set of means in the dataset, then characterize newer data based on that. My hope is this can characterize multi-modal results,<span style='color:#1F497D'><o:p></o:p></span></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>| </span><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'>be resilient to short spikes and detect long term motion in the dataset. I have this prototyped in LNT, but I am still trying to work out the best criteria to flag regression with. </span><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p></o:p></span></p></div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Basic question: I'm imagining the volume of data being dealt with isn't that large (as statistical datasets go) and you're discarding old values anyway (since we care if we're regressing wrt now rather than LLVM 1.1), so can't you just build a kernel density estimator of the "baseline" runtime and then estimate the probabilities that samples from a given codebase are going to happening "slower" than the baseline? I suppose the drawback to not explicitly modelling the modes (with all its complications and tunings) is that you can't attempt to determine when a value is bigger than a lower cluster, even though it's smaller than the bigger cluster and estimate if it's evidence of a slowdown within the small cluster regime. Still that seems a bit complicated to do automatically.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>(Inicidentally, responding to the earlier email below, I think you don't really want to compare moving averages but use some statistical test to quantify if the separation between the set of points within the "earlier window" are statistically significantly higher than the "later window"; all moving averages do is smear out useful information which can be useful if you've just got far too many data points, but otherwise it doesn't really help.<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Cheers,<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'>Dave<o:p></o:p></span></p><p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p> </o:p></span></p></div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'>Probably obvious anyways but: since the LNT data is only as good as the setup it is run on, the other thing that has helped us is coming up with a set of best practices for running the benchmarks on a machine. A machine which is “stable” produces much better results, but achiving this is more complex than not playing Starcraft while LNT is running. You have to make sure power management is not mucking with clock rates, and that none of the magic backup/indexing/updating/networking/screensaver stuff on your machine is running. In practice, I have seen a process using 50% of the CPU on 1 core of 8 move the stddev of a good benchmark +5%, and having 2 cores loaded on an 8 core machine trigger hundreds of regressions in LNT.<o:p></o:p></span></p></div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'><o:p> </o:p></span></p></div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'><o:p> </o:p></span></p></div><div><div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'>Chris Matthews<br><a href="mailto:chris.matthews@.com">chris.matthews@.com</a><br>(408) 783-6335<o:p></o:p></span></p></div></div></div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'><o:p> </o:p></span></p><div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'>On Jun 27, 2013, at 9:41 AM, Bob Wilson <<a href="mailto:bob.wilson@apple.com">bob.wilson@apple.com</a>> wrote:<o:p></o:p></span></p></div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'><br><br><o:p></o:p></span></p><div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'><br>On Jun 27, 2013, at 9:27 AM, Renato Golin <<a href="mailto:renato.golin@linaro.org">renato.golin@linaro.org</a>> wrote:<o:p></o:p></span></p></div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'><br><br><o:p></o:p></span></p><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'>On 27 June 2013 17:05, Tobias Grosser<span class=apple-converted-space> </span><<a href="mailto:tobias@grosser.es" target="_blank">tobias@grosser.es</a>><span class=apple-converted-space> </span>wrote:<o:p></o:p></span></p><div><div><blockquote style='border:none;border-left:solid #CCCCCC 1.0pt;padding:0cm 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm'><div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif";color:#222222'>We are looking for a good</span><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'> <span style='color:#222222'>way/value to show the reliability of individual results in the UI. Do you have some experience, what a good measure of the reliability of test results is?</span><o:p></o:p></span></p></div></div></blockquote></div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'><o:p> </o:p></span></p></div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'>Hi Tobi,<o:p></o:p></span></p></div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'><o:p> </o:p></span></p></div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'>I had a look at this a while ago, but never got around to actually work on it. My idea was to never use point-changes as indication of progress/regressions, unless there was a significant change (2/3 sigma). What we should do is to compare the current moving-average with the past moving averages (of K runs) with both last-avg and the (N-K)th moving-average (to make sure previous values included in the current moving average are not toning it down/up), and keep the biggest difference as the final result.<o:p></o:p></span></p></div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'><o:p> </o:p></span></p></div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'>We should also compare the current mov-avg with M non-overlapping mov-avgs before, and calculate if we're monotonically increasing, decreasing or if there is a difference of 2/3 sigma between the current mov-avg (N) and the (N-M)th mov-avg. That would give us an idea on the trends of each test.<o:p></o:p></span></p></div></div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'><o:p> </o:p></span></p></div><div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'>Chris Matthews has recently been working on implementing something similar to that. Chris, can you share some details?<o:p></o:p></span></p></div><p class=MsoNormal><span style='font-size:10.5pt;font-family:"Helvetica","sans-serif"'>_______________________________________________<br>LLVM Developers mailing list<br><a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a><span class=apple-converted-space> </span> <a href="http://llvm.cs.uiuc.edu/">http://llvm.cs.uiuc.edu</a><br><a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><o:p></o:p></span></p></div></div></div></div></div><p class=MsoNormal><o:p> </o:p></p></div></body></html>