[PATCH] D46735: [Test-Suite] Added Box Blur And Sobel Edge Detection

Fri May 11 16:50:08 PDT 2018

Meinersbur added a comment.

In https://reviews.llvm.org/D46735#1096128, @jdoerfert wrote:

> First, let me keep the record straight:
>
> - Only SingleSource/Benchmarks/Polybench profits from the (mostly tiling) transformations applied by Polly.
> - There are various reasons why other benchmarks are "not optimizable" by Polly but only a fraction is caused by manual "pre-optimizations" (except the input language choice obviously).
> - Adding simple "Polly-optimizable" benchmarks is all good and well (as it makes for nicer evaluation sections in future papers...), but I would argue it is much more interesting to investigate if the existing benchmarks could be optimized and why they currently are not.

I agree that studying existing sources, why they are not optimized, and improve the optimizer such that the reason is not an obstacle anymore, is the primary goal.
The problem is that this not feasible for all sources. For instance, the array-to-pointers-to-arrays style (which @proton unfortunately also used here; I call then "jagged arrays", although not necessarily jagged) cannot be optimized because the pointers may overlap. Either the frontend language has to ensure that this never happens, or each pointer has to be compared pairwise for aliasing. The first is not the case in C++ (without extensions such as the `restrict` keyword), the latter involves a super-constant overhead. Unfortunately, very many benchmarks use jagged arrays.
Second, even if it is possible to remove an optimization obstacle, I would like to know whether it is worth it.
Third, researchers in the field of polyhedral optimization work on improving the optimizer algorithm and ignore language-level details (e.g. whether a jagged, row-major or column-major arrays are used)

> Regarding these (and other new) benchmarks:
> 
> - Please describe why/how the codes differ from existing ones we have (e.g., Halide/blur). Polybench already contains various kernels including many almost identical ones.

The Halide benchmarks are special in many regards; for instance, works only on x86.

> - Please describe why/how the magic constants (aka sizes) are chosen. "#define windows 10" is not necessarily helpful.

At some point a problem size has to be arbitrarily defined. What kind of explanation do you expect?

> - I fail to see how Polly is going to optimize this code (in a way that is general enough for real codes). So my question is: Did you choose a linked data structure on purpose or do you actually want to have a multi-dimensional array?

+1

In https://reviews.llvm.org/D46735#1096497, @hfinkel wrote:

> In https://reviews.llvm.org/D46735#1096464, @proton wrote:
>
> > In https://reviews.llvm.org/D46735#1095482, @MatzeB wrote:
> >
> > > Are you writing these from scratch? If so, I'd like to make some suggestions:
> > >
> > > - Please aim for a runtime for 0.5-1 second on typical hardware. Shorter benchmarks tend to be hard to time correctly, running longer doesn't increase precision in our experience.
> >
> >
> > But the fraction of noise will be more for shorter runtimes. A longer runtime will help us when we to see the performance improvement after applying optimization.
>
>
> The question is: At what point is a performance change interesting? If we posit that a performance change is interesting at the ~1% level, and we can distinguish application-time running-time differences at around 0.01s, then running for 1-2s is sufficient for tracking. As the test suite gets larger, we have an overarching goal of keeping the overall execution time in check (in part, so we can run it more often). It's often better to collect statistics over multiple runs, compared to a single longer run, regardless.
>
> Also, if there are particular kernels you're trying to benchmark it's better to time them separately. We have a nice infrastructure to do that now, making use of the Google benchmark library, in the MicroBenchmarks subdirectory.

IMHO we also want to see effects on workingset sizes larger than the last-level-cache. A micro-benchmark is great for small workingsets, but I am not sure whether Google's benchmark library works well with longer ones.

Some optimizations (e.g. cache-locality, parallelization) can cut the execution time by order by magnitudes. With gemm, I have seen single-thread speed-ups of 34x. With parallelization, it will be even more. If the execution time without optimization is one second, it will be too short with optimization, especially with parallelization and accelerator-offloading which adds invocation overheads.

It's great to have a discussion on how such benchmarks should look like.

Instead of one-size-fits-it-all, should we have multiple problem sizes? There is already `SMALL_DATASET`, which is smaller than the default, but what about larger ones? SPEC has "test" (should execute everything at least once, great to check correctness), "train" (for PGO-training), "ref" (the scored benchmark input; in CPU 2017 runs up to 2 hrs). Polybench has `MINI_DATASET` to `EXTRALARGE_DATASET` which are defined by workingset-size, instead of purpose or runtime.

Should we embed the kernels in a framework such as Google's, provided that it handles long runtimes and verifies correctness of the result?

================
Comment at: SingleSource/Benchmarks/ImageProcessing/blur/blur.cpp:62
+    for (int i=0; i<height; i++){
+        img2dblur[i] = (int*)malloc(width*sizeof(int));
+        for (int j=0; j<width; j++) {
----------------
Please use a C-style arrays (int[WIDTH][HEIGHT]) or C99 Variable-Length-Arrays (VLAs) instead of array of pointers ("jagged array").

It is difficult to ensure that none of these pointers alias with each other, for Polly and any other optimizer. 

Repository:
  rT test-suite

https://reviews.llvm.org/D46735