Get rid of the 'max' and 'min' conditions

Wed Aug 26 01:02:16 PDT 2015

On 08/26/2015 07:34 AM, Roman Gareev wrote:
> Hi Tobias,
>
> I’ve attached a patch, which could be used to get rid of Max
> conditions in ScheduleTreeOptimizer::optimizeBand. What do you think
> about it?

Nice. That seems to work and even gives some speedups to me.

Without patch
============================================================
$time polly-clang linear-algebra/kernels/gemm/gemm.c -O3 -mllvm -polly -mllvm -debug-only=polly-ast -I utilities/ -mllvm -polly-vectorizer=polly -DPOLYBENCH_TIME -march=native utilities/polybench.c

     // 1st level tiling - Tiles
     #pragma known-parallel
     for (int c0 = 0; c0 <= floord(ni - 1, 32); c0 += 1)
       for (int c1 = 0; c1 <= floord(nj - 1, 32); c1 += 1)
         for (int c2 = 0; c2 <= floord(nk - 1, 32); c2 += 1) {
           // 1st level tiling - Points
           for (int c3 = 0; c3 <= min(31, ni - 32 * c0 - 1); c3 += 1)
             for (int c4 = 0; c4 <= min(7, -8 * c1 + (nj - 1) / 4); c4 += 1)
               for (int c5 = 0; c5 <= min(31, nk - 32 * c2 - 1); c5 += 1)
                 #pragma simd
                 for (int c6 = 0; c6 <= min(3, nj - 32 * c1 - 4 * c4 - 1); c6 += 1)
                   Stmt_for_body_6(32 * c0 + c3, 32 * c1 + 4 * c4 + c6, 32 * c2 + c5);
         }

else
     {  /* original code */ }

real	0m1.715s
user	0m1.684s
sys	0m0.032s
$./a.out
1.305441
$./a.out
1.278323
$./a.out
1.217736
============================================================

With patch:
============================================================
$time polly-clang linear-algebra/kernels/gemm/gemm.c -O3 -mllvm -polly -mllvm -debug-only=polly-ast -I utilities/ -mllvm -polly-vectorizer=polly -DPOLYBENCH_TIME -march=native utilities/polybench.c

     // 1st level tiling - Tiles
     #pragma known-parallel
     for (int c0 = 0; c0 <= floord(ni - 1, 32); c0 += 1)
       for (int c1 = 0; c1 <= floord(nj - 1, 32); c1 += 1)
         for (int c2 = 0; c2 <= floord(nk - 1, 32); c2 += 1) {
           // 1st level tiling - Points
           for (int c3 = 0; c3 <= min(31, ni - 32 * c0 - 1); c3 += 1)
             for (int c4 = 0; c4 <= min(7, -8 * c1 + (nj - 1) / 4); c4 += 1)
               for (int c5 = 0; c5 <= min(31, nk - 32 * c2 - 1); c5 += 1) {
                 if (nj >= 32 * c1 + 4 * c4 + 4) {
                   #pragma simd
                   for (int c6 = 0; c6 <= 3; c6 += 1)
                     Stmt_for_body_6(32 * c0 + c3, 32 * c1 + 4 * c4 + c6, 32 * c2 + c5);
                 } else
                   #pragma simd
                   for (int c6 = 0; c6 < nj - 32 * c1 - 4 * c4; c6 += 1)
                     Stmt_for_body_6(32 * c0 + c3, 32 * c1 + 4 * c4 + c6, 32 * c2 + c5);
               }
         }

else
     {  /* original code */ }

real	0m1.710s
user	0m1.666s
sys	0m0.045s
$./a.out
0.840746
$./a.out
0.918923
$./a.out
0.921482
============================================================

Now, looking at this code, I wonder if it would not make sense to do the separation
higher up in the tree. Specifically, do the isolation for 1st level tiles. This should also
remove the min/max condition but will add the condition further up in the tree. This could
be faster by itself and will also benefit register tiling and such kind of transformations.

Now, this suggestion is just from looking at the code. Some LLVM transformations may in some
cases eliminate this condition (e.g. loop unswitching) or dead-code-removal after inlining.
I tried your patch with -fno-inline and it is suprisingly surprisingly to me even faster code:

$./a.out
0.469482

Best,
Tobias
Best,
Tobias