[PATCH] [LNT] Use Mann-Whitney U test to identify changes

Fri May 2 16:48:28 PDT 2014

Dear Chris,

I've updated and merged the two patches. Please test it as I don't have access to the work machine right now.

Not sure how to share a constant between Python and HTML code, give me a hint if you know a easy way.

That was my intention to keep same behaviour as variance(always shown regardless of choice). I could change that if you want.

Cheers,
Yi Kong
________________________________________
From: Chris Matthews [chris.matthews at apple.com]
Sent: 03 May 2014 00:17
To: Yi Kong
Cc: Tobias Grosser; llvm-commits at cs.uiuc.edu
Subject: Re: [PATCH] [LNT] Use Mann-Whitney U test to identify changes

As for the p-value patch:

In v4_runs:

<td><input type="checkbox" name="show_p_value" value="yes" {{ "checked" if options.hide_report_by_default else ""}}></td>

is a copy and paste error (wrong options.)

in v4_run.html

  {% if options.show_p_value %}
    <td>{{cr.get_p_value()}}</td>
  {% endif %}

Could you limit that to 4 digits, like the other columns.

Was it your intention to show p-values in the run-over-run always, even if show p-values is not ticked?

On May 2, 2014, at 5:59 AM, Yi Kong <Yi.Kong at arm.com> wrote:

> On 02/05/14 11:02, Tobias Grosser wrote:
>> On 02/05/2014 11:16, Yi Kong wrote:
>>> Hi Tobias,
>>>
>>> On 02/05/14 08:38, Tobias Grosser wrote:
>>>> On 01/05/2014 23:27, Yi Kong wrote:
>>>>> This patch adds Mann-Whitney U tests to identify changes, as
>>>>> suggested by Tobias and Anton. User is able to configure the desired
>>>>> confidence level.
>>>>
>>>> Hi Yi Kong,
>>>>
>>>> thanks for this nice patch. I looked into it briefly by setting up an
>>>> LNT server and adding a couple of the -O3 nightly test results to it. It
>>>> seems at least with the default 0.5 confidence level this does not
>>>> reduce the noise at all. Just switching aggregative function from
>>>> minimum to mean helps here a lot more (any idea why?).
>>>
>>> Median is far less affected by variance than minimum. Minimum may even
>>> be an outlier.
>>
>> I previously suggested to change this, but Chris had concerns. If you
>> also believe it would be good to show the median by default, maybe it is
>> worth resubmitting a patch?
>
> I'm not sure what concern Chris has.
>
>>>> Did you play with
>>>> the confidence level and got an idea which level would be useful?
>>>> My very brief experiments showed that a value of 0.999 or even 0.9999 is
>>>> something that gets us below the noise level. I verified this by looking
>>>> at subsequent runs where the commits itself really just where
>>>> documentation commits. Those commits should not show any noise. Even
>>>> with those high confidence requirements, certain performance regressions
>>>> such as r205965 can still be spotted. For me, this is already useful as
>>>> we can really ask for extremely low noise answers,
>>>> which will help to at least catch the very clear performance
>>>> regressions. (Compared to today, where even those are hidden in the
>>>> reporting noise)
>>>
>>> I've been experimenting with the same dataset as yours. It seems 0.9
>>> eliminates some noises, but not good enough. Although 0.999 produces
>>> very nice results, but:
>>>  >>> scipy.stats.mannwhitneyu([1,1,1,1,1],[2,2,2,2,1])
>>> (2.5, 0.0099822266526080217)
>>> That's only 0.99! Anything greater than 0.9 will cause too many false
>>> negatives.
>>
>> We are running 10 runs, no?
>>
>>  >>> scipy.stats.mannwhitneyu(
>>      [1,1,1,1,1,1,1,1,1,1],[2,2,2,2,2,2,2,2,2,1])
>>
>>      (5.0, 4.8302595041055999e-05)
>>
>>>> I would like to play with this a little bit more. Do you think it is
>>>> possible to print the p value in the Run-Over-Run Changes Details?
>>>
>>> That shouldn't be too difficult. I could implement it if you want.
>>
>> Cool.
>
> Patches are attached. I've also adjusted default p to a more sensible
> value(0.9).
>
>>> Alternatively you can just print them from the console which only takes
>>> a one line change.
>>
>> I think having them in the webinterface is more convenient. Especially
>> if we want to look at only the changes that are reported as changing
>> performance.
>>
>>>> Also, it may make sense to investigate this on another machine. I use 5
>>>> identical but different machines. It may be interesting to see if runs
>>>> on a same machine are more reliable and could get away with a lower
>>>> confident interval. Did you do any experiments? Maybe with a higher run
>>>> number 20?
>>>
>>> For now I don't have other machine to test on, my desktop is far too
>>> noisy. I've been trying to set up an ARM board.
>>
>> I have one machine identical to the automatic builders, but not
>> connected to any builder. If you send me your SSH public key, I can give
>> you access. We could use this to experiment with higher number of
>> runs, binding benchmarks to fixed CPUs, using 'perf stat' to count the
>> performance, ..
>>
>> I am very interested in making this results as robust as possible. If
>> necessary, I would rather choose a smaller set of benchmarks and run
>> them more often, rather than having to deal with this ongoing noise.
>>
>
> Great. My pubkey is
> ecdsa-sha2-nistp256
> AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBHHkjnXnGZBfTgZnFfxwobOb0so8kEMm+aKqjOEL2ejKrhwuDIb4HV57g1XHF+o+MOIN1W+2dQU2jV/I6rD/xRA=
>
> I think hand picking a smaller set of benchmarks is the best idea, since
> most of the tests we are running are not useful as compiler benchmarks
> and would make running several iterations possible on ARM boards. As for
> 'perf stat', I've concerned about the portability of LNT as it is
> exclusive to Linux.
>
> On 02/05/14 11:10, Tobias Grosser wrote:
> > such kind of footer in the best case just does not make a lot of sense
> > on a public mailing list, but it may also cause actual problems when
> > submitting patches. I don't really understand how it works at ARM,
> > but for several of their mails James Molloy, Kristof Beyls, Bradley
> > Smith, apparently managed to do so. If it is trivial, maybe you could
> > consider it as well.
>
> I've contact the IT to get it done. Thanks for notifying me.
>
> Regards,
> Yi Kong
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2548782<0001-Use-Mann-Whitney-U-test-to-identify-changes.patch><0002-Show-p-value-in-report.patch>

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2548782
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Use-Mann-Whitney-U-test-to-identify-changes.patch
Type: text/x-patch
Size: 10388 bytes
Desc: 0001-Use-Mann-Whitney-U-test-to-identify-changes.patch
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140503/ec8b417a/attachment.bin>