[www] r176209 - Add LNT statistics project

Thu Feb 28 11:28:40 PST 2013

On Thu, Feb 28, 2013 at 11:08 AM, Renato Golin <renato.golin at linaro.org> wrote:
> On 28 February 2013 18:53, David Blaikie <dblaikie at gmail.com> wrote:
>>
>> I'm really confused by what you're saying/getting at.
>
>
> We both are... ;)
>
>
>> Are you suggesting that these are acceptable tests but that my
>> proposal to add "execute multiple times until desired confidence is
>> achieved" on top of them would be problematic? Or are you suggesting
>> that these are examples of bad ideas that are similar/equivalent to my
>> idea?
>
>
> Linpack and Dhrystone are bad examples of compiler tests (IMHO), since
> they're developed as platform tests. (see below)
>
>
>> I don't think they're equivalent & I think they're a bad idea as a test
>> suite.
>
>
> I agree the tests we currently run could be much better executed.
>
>
>> Execution for benchmarks should be deterministic in their behavior
>> (including execution time/space) not based on clock/wall time
>> (measuring execution time of an annealing algorithm that spends N
>> seconds annealing is useless, for example... because it's always going
>> to take N seconds).
>
>
> Yes, you got me wrong, that would be silly. ;)
>
>
>> 3 seconds +/- 0.25 of a second" (or something like that - of course we
>> could keep all the actual run times in the report & summarize it in
>> different ways for different clients).
>
>
> I'd prefer to only output raw numbers and let presentation / crunching logic
> to the infrastructure.
>
>
>> It could even go a step further and use the previous runs to decide
>> how much confidence was required for the current run (eg: if
>> yesterday's run was 3+/-0.25 and after a fewer runs today the test
>> reached 5+/-1 we could stop running even though the confidence
>> interval was wider - because we're already way outside the bounds
>> based on yesterday's performance). This is probably overkill but just
>> an example.
>
>
> Not all tests have clear idea of confidence. Changing benchmarks to add the
> idea of confidence requires target specific behaviour (see Livermore Loops)
> and that's not a good way to spot compiler regressions (IMHO), but is still
> a valid platform benchmark.
>
> My main point is the difference between platform (CPU+MEM+OS+Software+etc),
> rather than simply compiler optimization and code quality benchmarks.
>
> The first category allows heuristics and dynamic testing (confidence check,
> etc), the second doesn't.
>
> Things like micro-benchmarks or vectorization tests can suffer a lot from
> change in one instruction, and if you add heuristics, you can eliminate the
> problem before even running the test (by detecting the slowness and trying
> something different, for example), and we'd never spot the regression that
> will show in user code (where no heuristics is being done).

I'm still confused as to which things you're talking about. My
suggestion is that we can get higher confidence on the performance of
tests by running them multiple times. Also, that we can reduce the
cost of that increased confidence by choosing the number of runs
dynamically to reduce the time we pay. Given how long these runs take
and how important it is to get finer granularity with the same
hardware resources we have.

> Not to mention that simply increasing the number of cycles by a fixed amount
> is far simpler to do on a batch of hundreds of completely different
> benchmarks. :D

Right - and any test like Livermore that uses heuristics & timing to
motivate its own behavior is a problem even for this idea "run a fixed
multiple" (& our buildbot configuration currently runs 3 times - but
it runs the entire suite 3 times one after the other, rather than
running each test 3 times in a row - so you'll still not get terribly
stable numbers because each run is cold (but, yes, this might be more
realistic - but noisy results aren