[PATCH] D38417: [test-suite] Adding HACCKernels app

Sat Sep 30 13:29:02 PDT 2017

hfinkel added a comment.

In https://reviews.llvm.org/D38417#885236, @kristof.beyls wrote:

> In https://reviews.llvm.org/D38417#885226, @hfinkel wrote:
>
> > In https://reviews.llvm.org/D38417#885195, @kristof.beyls wrote:
> >
> > > Hi Brian,
> > >
> > > Thanks for working on this!
> > >
> > > On the execution time of about 13 seconds on a fast processor: would it be possible to adapt the input to reduce the running time to about 1 second (or less), and not loose the characteristic behaviour of this benchmark?
> > >  I'd expect adapting the input to have an order of magnitude shorter running time will not make the measured execution time more noisy; and it would help in trying to make sure the test-suite keeps on running as quickly as possible.
> >
> >
> > Kristof, just FYI, reducing the runtime is easy to do, however, this will get significantly faster once we start actually vectorizing the hot loops (at least 4x if you have an architecture with <4 x float>). I looked at this a few weeks ago, and as I recall, we currently can't if-convert reductions (which prevents vectorization). We plan on working on improving this in the near future. Does this affect your opinion at all?
>
>
> I mainly keep on focussing on the run-time of newly added benchmarks to the test-suite as I do think it is a problem already that the test-suite takes too long to run.

I'm not sure that it runs too long in the abstract, but we certainly waste CPU time by having programs that run for longer than necessary.

> For the benchmarks in the test-suite, this problem is far worse than for the programs that are only run to check correctness,

Agreed.

> as we typically:
>  a) have to run the test-suite in benchmark-mode multiple times to figure out if a performance change was noise or significant.
>  b) have to run programs sequentially in benchmark-mode even on multi-core systems, to reduce noise, whereas for correctness testing we can use all the cores on a system.

At the risk of going too far afield, this has not been universally my experience. When checking for performance, on build servers with ~50 hardware threads, I often run with the test suite with a level of parallelism matching the number of hardware threads. I'd run the test suite ~15 times and then use ministat (https://github.com/codahale/ministat) to compare the ~15 timings from each test to a previous run. I've found these numbers to be better than quiet-server serial runs for two reasons: First, even a quiet server is noisy and we need to run the test multiple times (unless they really run for a long time), and second, the cores are in a more production-like state (where, for example, multiple hardware threads are being used and there's contention for the cache). I/O-related times are obviously more variable this way, but I've generally found that tests that run for a second (and as low of 0.2s on some systems) are fine for this kind of configuration. Also, so long as you have more than 30 hardware threads (or something like that, depending on the architecture), it's actually faster this way than a single serial run. Moreover, ministat gives error bars :-)

In case you're curious, there's also a Python version of ministat (https://github.com/lebinh/ministat).

> This is annoying when evaluating patches, and also makes the response time of performance-tracking bots annoyingly long.
>  We have a few hundred benchmarks in the test-suite currently, but probably need a lot more to get more coverage (which is why I think it's awesome that these DOE benchmarks are being added!).

We definitely need more coverage for performance. We also need *a lot* more coverage for correctness (i.e. the fact that I catch far more miscompiles from self hosting than from the test suite is a problem).

> Therefore, I think it's important to not lose focus on trying to keep benchmarks short-running as they are being added.
> 
> There's probably a lot of bikeshedding that could be done on what an acceptable run-time is for a newly-added benchmark and what is too long.
>  My experience on a few X86 and Arm platforms is that if you use linux perf to measure execution time, as soon as the program runs for 0.01 seconds, just running the program for longer doesn't reduce noise further.
>  Therefore, my limited experiments suggest to me that an ideal execution time for the benchmark programs in the test-suite would be just over 0.01 seconds - for the platforms I've tested on.
>  As I said, there's probably lots of arguing that could be done on what the execution time is that we should aim for when adding a new benchmark. So far, I've followed a personal rule-of-thumb that up to 1 second is acceptable, but when its more, there should be a reason for why a longer execution time is needed.

This is also close to my experience; aiming for about a second, maybe two, makes sense.

> Which is why I reacted above.
>  As I don't think my personal 1 second rule-of-thumb is defendable any more or less than rules that set the threshold a bit higher or lower, I don't feel too strongly against this benchmark going in as is.
>  I just felt I had to ask the question if there was a good reason to make this benchmark run for this long.
>  Ultimately, vectorizing the hot loops in this benchmark won't make a change to my reasoning above.
> 
> In summary, I hope my reasoning above makes sense, and I won't oppose if you think there's a good reason to not shorten the running time of this benchmark as is.

Okay. I propose that we shorten the current running time to around 1.5 seconds. That should leave sufficient running time once we start vectorizing the loops.

> Thanks!
> 
> Kristof

https://reviews.llvm.org/D38417