Get rid of the 'max' and 'min' conditions
Tobias Grosser via llvm-commits
llvm-commits at lists.llvm.org
Wed Aug 26 01:02:16 PDT 2015
On 08/26/2015 07:34 AM, Roman Gareev wrote:
> Hi Tobias,
>
> I’ve attached a patch, which could be used to get rid of Max
> conditions in ScheduleTreeOptimizer::optimizeBand. What do you think
> about it?
Nice. That seems to work and even gives some speedups to me.
Without patch
============================================================
$time polly-clang linear-algebra/kernels/gemm/gemm.c -O3 -mllvm -polly -mllvm -debug-only=polly-ast -I utilities/ -mllvm -polly-vectorizer=polly -DPOLYBENCH_TIME -march=native utilities/polybench.c
// 1st level tiling - Tiles
#pragma known-parallel
for (int c0 = 0; c0 <= floord(ni - 1, 32); c0 += 1)
for (int c1 = 0; c1 <= floord(nj - 1, 32); c1 += 1)
for (int c2 = 0; c2 <= floord(nk - 1, 32); c2 += 1) {
// 1st level tiling - Points
for (int c3 = 0; c3 <= min(31, ni - 32 * c0 - 1); c3 += 1)
for (int c4 = 0; c4 <= min(7, -8 * c1 + (nj - 1) / 4); c4 += 1)
for (int c5 = 0; c5 <= min(31, nk - 32 * c2 - 1); c5 += 1)
#pragma simd
for (int c6 = 0; c6 <= min(3, nj - 32 * c1 - 4 * c4 - 1); c6 += 1)
Stmt_for_body_6(32 * c0 + c3, 32 * c1 + 4 * c4 + c6, 32 * c2 + c5);
}
else
{ /* original code */ }
real 0m1.715s
user 0m1.684s
sys 0m0.032s
$./a.out
1.305441
$./a.out
1.278323
$./a.out
1.217736
============================================================
With patch:
============================================================
$time polly-clang linear-algebra/kernels/gemm/gemm.c -O3 -mllvm -polly -mllvm -debug-only=polly-ast -I utilities/ -mllvm -polly-vectorizer=polly -DPOLYBENCH_TIME -march=native utilities/polybench.c
// 1st level tiling - Tiles
#pragma known-parallel
for (int c0 = 0; c0 <= floord(ni - 1, 32); c0 += 1)
for (int c1 = 0; c1 <= floord(nj - 1, 32); c1 += 1)
for (int c2 = 0; c2 <= floord(nk - 1, 32); c2 += 1) {
// 1st level tiling - Points
for (int c3 = 0; c3 <= min(31, ni - 32 * c0 - 1); c3 += 1)
for (int c4 = 0; c4 <= min(7, -8 * c1 + (nj - 1) / 4); c4 += 1)
for (int c5 = 0; c5 <= min(31, nk - 32 * c2 - 1); c5 += 1) {
if (nj >= 32 * c1 + 4 * c4 + 4) {
#pragma simd
for (int c6 = 0; c6 <= 3; c6 += 1)
Stmt_for_body_6(32 * c0 + c3, 32 * c1 + 4 * c4 + c6, 32 * c2 + c5);
} else
#pragma simd
for (int c6 = 0; c6 < nj - 32 * c1 - 4 * c4; c6 += 1)
Stmt_for_body_6(32 * c0 + c3, 32 * c1 + 4 * c4 + c6, 32 * c2 + c5);
}
}
else
{ /* original code */ }
real 0m1.710s
user 0m1.666s
sys 0m0.045s
$./a.out
0.840746
$./a.out
0.918923
$./a.out
0.921482
============================================================
Now, looking at this code, I wonder if it would not make sense to do the separation
higher up in the tree. Specifically, do the isolation for 1st level tiles. This should also
remove the min/max condition but will add the condition further up in the tree. This could
be faster by itself and will also benefit register tiling and such kind of transformations.
Now, this suggestion is just from looking at the code. Some LLVM transformations may in some
cases eliminate this condition (e.g. loop unswitching) or dead-code-removal after inlining.
I tried your patch with -fno-inline and it is suprisingly surprisingly to me even faster code:
$./a.out
0.469482
Best,
Tobias
Best,
Tobias
More information about the llvm-commits
mailing list