[www] r176209 - Add LNT statistics project

Thu Feb 28 11:08:19 PST 2013

On 28 February 2013 18:53, David Blaikie <dblaikie at gmail.com> wrote:

> I'm really confused by what you're saying/getting at.
>

We both are... ;)

Are you suggesting that these are acceptable tests but that my
> proposal to add "execute multiple times until desired confidence is
> achieved" on top of them would be problematic? Or are you suggesting
> that these are examples of bad ideas that are similar/equivalent to my
> idea?
>

Linpack and Dhrystone are bad examples of compiler tests (IMHO), since
they're developed as platform tests. (see below)

I don't think they're equivalent & I think they're a bad idea as a test
> suite.
>

I agree the tests we currently run could be much better executed.

Execution for benchmarks should be deterministic in their behavior
> (including execution time/space) not based on clock/wall time
> (measuring execution time of an annealing algorithm that spends N
> seconds annealing is useless, for example... because it's always going
> to take N seconds).
>

Yes, you got me wrong, that would be silly. ;)

3 seconds +/- 0.25 of a second" (or something like that - of course we
> could keep all the actual run times in the report & summarize it in
> different ways for different clients).
>

I'd prefer to only output raw numbers and let presentation / crunching
logic to the infrastructure.

It could even go a step further and use the previous runs to decide
> how much confidence was required for the current run (eg: if
> yesterday's run was 3+/-0.25 and after a fewer runs today the test
> reached 5+/-1 we could stop running even though the confidence
> interval was wider - because we're already way outside the bounds
> based on yesterday's performance). This is probably overkill but just
> an example.
>

Not all tests have clear idea of confidence. Changing benchmarks to add the
idea of confidence requires target specific behaviour (see Livermore Loops)
and that's not a good way to spot compiler regressions (IMHO), but is still
a valid platform benchmark.

My main point is the difference between platform (CPU+MEM+OS+Software+etc),
rather than simply compiler optimization and code quality benchmarks.

The first category allows heuristics and dynamic testing (confidence check,
etc), the second doesn't.

Things like micro-benchmarks or vectorization tests can suffer a lot from
change in one instruction, and if you add heuristics, you can eliminate the
problem before even running the test (by detecting the slowness and trying
something different, for example), and we'd never spot the regression that
will show in user code (where no heuristics is being done).

Not to mention that simply increasing the number of cycles by a fixed
amount is far simpler to do on a batch of hundreds of completely different
benchmarks. :D

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130228/fcf9b5c3/attachment.html>