[PATCH] D46735: [Test-Suite] Added Box Blur And Sobel Edge Detection

Fri May 11 18:53:45 PDT 2018

hfinkel added a comment.

In https://reviews.llvm.org/D46735#1096703, @Meinersbur wrote:

> In https://reviews.llvm.org/D46735#1096128, @jdoerfert wrote:
>
> > First, let me keep the record straight:
> >
> > - Only SingleSource/Benchmarks/Polybench profits from the (mostly tiling) transformations applied by Polly.
> > - There are various reasons why other benchmarks are "not optimizable" by Polly but only a fraction is caused by manual "pre-optimizations" (except the input language choice obviously).
> > - Adding simple "Polly-optimizable" benchmarks is all good and well (as it makes for nicer evaluation sections in future papers...), but I would argue it is much more interesting to investigate if the existing benchmarks could be optimized and why they currently are not.
>
>
> I agree that studying existing sources, why they are not optimized, and improve the optimizer such that the reason is not an obstacle anymore, is the primary goal.
>  The problem is that this not feasible for all sources. For instance, the array-to-pointers-to-arrays style (which @proton unfortunately also used here; I call then "jagged arrays", although not necessarily jagged) cannot be optimized because the pointers may overlap. Either the frontend language has to ensure that this never happens, or each pointer has to be compared pairwise for aliasing. The first is not the case in C++ (without extensions such as the `restrict` keyword), the latter involves a super-constant overhead. Unfortunately, very many benchmarks use jagged arrays.
>  Second, even if it is possible to remove an optimization obstacle, I would like to know whether it is worth it.
>  Third, researchers in the field of polyhedral optimization work on improving the optimizer algorithm and ignore language-level details (e.g. whether a jagged, row-major or column-major arrays are used)
>
> > Regarding these (and other new) benchmarks:
> > 
> > - Please describe why/how the codes differ from existing ones we have (e.g., Halide/blur). Polybench already contains various kernels including many almost identical ones.
>
> The Halide benchmarks are special in many regards; for instance, works only on x86.
>
> > - Please describe why/how the magic constants (aka sizes) are chosen. "#define windows 10" is not necessarily helpful.
>
> At some point a problem size has to be arbitrarily defined. What kind of explanation do you expect?
>
> > - I fail to see how Polly is going to optimize this code (in a way that is general enough for real codes). So my question is: Did you choose a linked data structure on purpose or do you actually want to have a multi-dimensional array?
>
> +1
>
> In https://reviews.llvm.org/D46735#1096497, @hfinkel wrote:
>
> > In https://reviews.llvm.org/D46735#1096464, @proton wrote:
> >
> > > In https://reviews.llvm.org/D46735#1095482, @MatzeB wrote:
> > >
> > > > Are you writing these from scratch? If so, I'd like to make some suggestions:
> > > >
> > > > - Please aim for a runtime for 0.5-1 second on typical hardware. Shorter benchmarks tend to be hard to time correctly, running longer doesn't increase precision in our experience.
> > >
> > >
> > > But the fraction of noise will be more for shorter runtimes. A longer runtime will help us when we to see the performance improvement after applying optimization.
> >
> >
> > The question is: At what point is a performance change interesting? If we posit that a performance change is interesting at the ~1% level, and we can distinguish application-time running-time differences at around 0.01s, then running for 1-2s is sufficient for tracking. As the test suite gets larger, we have an overarching goal of keeping the overall execution time in check (in part, so we can run it more often). It's often better to collect statistics over multiple runs, compared to a single longer run, regardless.
> >
> > Also, if there are particular kernels you're trying to benchmark it's better to time them separately. We have a nice infrastructure to do that now, making use of the Google benchmark library, in the MicroBenchmarks subdirectory.
>
>
> IMHO we also want to see effects on workingset sizes larger than the last-level-cache. A micro-benchmark is great for small workingsets, but I am not sure whether Google's benchmark library works well with longer ones.

I don't see why it wouldn't work on longer-running kernels. Nevertheless, modern machines have bandwidths in the GB/s range, so a 1s running time is certainly long enough to move around a working set larger than your cache size.

> Some optimizations (e.g. cache-locality, parallelization) can cut the execution time by order by magnitudes. With gemm, I have seen single-thread speed-ups of 34x. With parallelization, it will be even more. If the execution time without optimization is one second, it will be too short with optimization, especially with parallelization and accelerator-offloading which adds invocation overheads.

There are two difficult issues here. First, running with multiple threads puts you in a different regime for several reasons, and often one that really needs to be tested separately (because of different bandwidth constraints, different effects of prefetching, effects from using multiple hardware threads, and so on). We don't currently have an infrastructure for testing threaded code (although we probably should).

Second, I don't think that we can have a set of problem sizes that can stay the same across 40x performance improvements. If the compiler starts doing that, we'll need to change the test somehow. If we make the test long enough that, once 40x faster, it will have a reasonable running time, then until then, the test suite will be unreasonably slow for continuous integration. I think that we need to pick problems that works reasonably now, and when the compiler improves, we'd need to change the test. One of the reasons that I like the Google bechmark library is that it dynamically adjusts the number of iterations, thus essentially changing this for us as needed.

> It's great to have a discussion on how such benchmarks should look like.
> 
> Instead of one-size-fits-it-all, should we have multiple problem sizes? There is already `SMALL_DATASET`, which is smaller than the default, but what about larger ones? SPEC has "test" (should execute everything at least once, great to check correctness), "train" (for PGO-training), "ref" (the scored benchmark input; in CPU 2017 runs up to 2 hrs). Polybench has `MINI_DATASET` to `EXTRALARGE_DATASET` which are defined by workingset-size, instead of purpose or runtime.

We already have a SMALL_PROBLEM_SIZE setting. I don't think there's anything preventing us from adding other ones, although it's not clear to me how often they'd be used.

> Should we embed the kernels in a framework such as Google's, provided that it handles long runtimes and verifies correctness of the result?

I don't recall if it has a way to validate correctness.

Repository:
  rT test-suite

https://reviews.llvm.org/D46735