[www] r176209 - Add LNT statistics project

Thu Feb 28 10:53:48 PST 2013

On Thu, Feb 28, 2013 at 9:53 AM, Renato Golin <renato.golin at linaro.org> wrote:
> On 28 February 2013 17:05, David Blaikie <dblaikie at gmail.com> wrote:
>>
>> To be clear the intention is not to rewrite LNT - but the test-suite
>> beneath it. It's a complex hodge-podge of shell, Make, C, awk, etc...
>> difficult to maintain/add new features to. LNT was built with the
>> intention that the test-suite execution could be rewritten beneath it.
>
>
> Oh, that. Well, yes, it's a bit hacky, but I haven't delved deep enough to
> know much.
>
>
>> Hardly my forte, though I don't immediately see why changing the
>> number of cycles would make regression analysis invalid.
>
>
> Because I can only know if a benchmark is regressing if it's static and the
> run-time changes or if there is a clear output per time unit.

I'm really confused by what you're saying/getting at.

> But output per time unit (like Linpack or Dhrystone) are measure of raw
> output, not compiler performance. They're good to compare two different
> architectures, but not so good to spot regressions between revisions. All of
> them require some sort of fine tuning and heuristics to determine start and
> stop steps.

Are you suggesting that these are acceptable tests but that my
proposal to add "execute multiple times until desired confidence is
achieved" on top of them would be problematic? Or are you suggesting
that these are examples of bad ideas that are similar/equivalent to my
idea?

I don't think they're equivalent & I think they're a bad idea as a test suite.

Execution for benchmarks should be deterministic in their behavior
(including execution time/space) not based on clock/wall time
(measuring execution time of an annealing algorithm that spends N
seconds annealing is useless, for example... because it's always going
to take N seconds).

My proposal is not to time N seconds of runs in aggregate (where N is
a constant) because that would just always produce "N seconds". My
thinking was to run the test multiple times until we could get a
sufficiently accurate confidence interval. The result of the test
would, instead of "this test took 3 seconds to run" but "this tet took
3 seconds +/- 0.25 of a second" (or something like that - of course we
could keep all the actual run times in the report & summarize it in
different ways for different clients).

It could even go a step further and use the previous runs to decide
how much confidence was required for the current run (eg: if
yesterday's run was 3+/-0.25 and after a fewer runs today the test
reached 5+/-1 we could stop running even though the confidence
interval was wider - because we're already way outside the bounds
based on yesterday's performance). This is probably overkill but just
an example.

> For instance, Linpack keeps trying bigger matrices until one run takes more
> than 10s, which means that from run to run you can have, say 4 or 5 cycles.
> It also automatically selects the initial run's size based on some
> heuristics, so if something changes in the platform (space available,
> memory, etc), the heuristics could change the initial run, and you wouldn't
> be able to compare any run after that with the runs before the change.
>
>
>
>> If the OS is differently loaded each time you run the tests you're
>> going to have a hard time doing regression analysis anyway, aren't
>> you?
>
>
> Yes, but what I was saying was still related to the initial run / number of
> runs. If your last usual run is, say 2048 bytes matrices that usually take
> 10.05s and one day it takes 9.95, you'll end up with another run. Depending
> on the heuristics (Livermore Loops had a particularly troubling one), you
> could change the results completely from run to run.

I'm still really confused by what you're talking about. My intention
is not to time N seconds worth of runs - that would just always
produce a number around N seconds. (now running N seconds worth of
runs but reporting the seconds per run value with a confidence
interval is different - then you'll get varying amounts of confidence
(depending on how many runs you can fit in N seconds) but you'll still
get the actual seconds per run number going up/down based on
variations in performance of the compiler)

- David