[PATCH] D46735: [Test-Suite] Added Box Blur And Sobel Edge Detection

Fri May 11 22:33:23 PDT 2018

hfinkel added a comment.

In https://reviews.llvm.org/D46735#1096808, @Meinersbur wrote:

> In https://reviews.llvm.org/D46735#1096782, @hfinkel wrote:
>
> > I don't see why it wouldn't work on longer-running kernels.
>
>
> It might run the kernel just once. That is, we only get results from a cold cache.

I don't believe that it will run just once. There's a minimum number of iterations (in part, as I understand it, because it needs to get an estimate of the variance).

> 
> 
>> Nevertheless, modern machines have bandwidths in the GB/s range, so a 1s running time is certainly long enough to move around a working set larger than your cache size.
> 
> High-complexity algorithms such as naive matrix determinant may require more time for problems larger than the last-level cache.

Granted.

> 
> 
>>> Some optimizations (e.g. cache-locality, parallelization) can cut the execution time by order by magnitudes. With gemm, I have seen single-thread speed-ups of 34x. With parallelization, it will be even more. If the execution time without optimization is one second, it will be too short with optimization, especially with parallelization and accelerator-offloading which adds invocation overheads.
>> 
>> There are two difficult issues here. First, running with multiple threads puts you in a different regime for several reasons, and often one that really needs to be tested separately (because of different bandwidth constraints, different effects of prefetching, effects from using multiple hardware threads, and so on). We don't currently have an infrastructure for testing threaded code (although we probably should).
>> 
>> Second, I don't think that we can have a set of problem sizes that can stay the same across 40x performance improvements. If the compiler starts doing that, we'll need to change the test somehow. If we make the test long enough that, once 40x faster, it will have a reasonable running time, then until then, the test suite will be unreasonably slow for continuous integration. I think that we need to pick problems that works reasonably now, and when the compiler improves, we'd need to change the test. One of the reasons that I like the Google bechmark library is that it dynamically adjusts the number of iterations, thus essentially changing this for us as needed.
> 
> If someone enables auto-parallelization, they probably should leave (at least some) cores available.

This doesn't just come up in that context. There are plenty of codes which use OpenMP or some threading-enabled library.

> For continuous integration, correctness is much more important such that such bots would run using a safe (in the sense that a missed optimization still executes in reasonable time). For dedicated benchmarking, we should select a larger problem size.
>  That is, the default configuration can be "safe" while having larger problem sizes (including just running more often like google benchmark) for different situations.

I'm not talking about CI for correctness, although we should obviously do that too, but doing regular performance monitoring.

Repository:
  rT test-suite

https://reviews.llvm.org/D46735