[LLVMdev] [Polly] Summary of some expensive compiler passes, especially PollyDependence
Star Tan
tanmx_star at yeah.net
Thu Aug 8 01:29:11 PDT 2013
Hi all,
I have summarized the top 10 compiler passes for Polly when compiling LLVM test-ssuite. Results can be viewed on:
https://gist.github.com/tanstar/581bcea1e4e03498f935/raw/f6a4ec4e8565f7a7bbdb924cd59fcf145caac039/Polly-top10
Based on the comparison between "clang -O3" and "polly -O3" listed on:
http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=14&baseline=14
we can see Polly's compile-time overhead is mainly resulted by some expensive Polly passes such as PollyDependence, PollyOptimization and PollyCodegen. Especially, I notice that the PollyDependence can lead to significant extra compile-time overhead. Its compile-time percentage for some expensive benchmarks can be summarized as:
nestedloop: 41.4% (Polly - Calculate dependence)
salsa20: 98.5% (Polly - Calculate dependence)
seidel-2d: 72.1% (Polly - Calculate dependence)
multiplies: 54.3% (Poly - Calculate dependence)
Puzzle: 22.8% (Poly - Calculate dependence)
As a result, it is critical to improve the PollyDependence pass. I have previously committed a patch file to split start value from SCEVAddRecExpr (r187728), which can improve the PollyDependence pass for some benchmarks as follows:
http://188.40.87.11:8000/db_default/v4/nts/25?baseline=18&compare_to=18
Obviously, it is not enough.
I have investigated the salsa20 benchmark, in which the PollyDependence pass consumes 98.5% of total compile-time. The key code of salsa20 benchmark consists of a serial of array operations:
#define R(a,b) (((a) << (b)) | ((a) >> (32 - (b))))
for (i = 20;i > 0;i -= 2) {
x[ 4] ^= R(x[ 0]+x[12], 7); x[ 8] ^= R(x[ 4]+x[ 0], 9);
x[12] ^= R(x[ 8]+x[ 4],13); x[ 0] ^= R(x[12]+x[ 8],18);
....
x[14] ^= R(x[13]+x[12],13); x[15] ^= R(x[14]+x[13],18);
}
which would be translated into a serial of affine functions:
ReadAccess := { Stmt_for_body5[i0] -> MemRef_x[12] };
ReadAccess := { Stmt_for_body5[i0] -> MemRef_x[4] };
MustWriteAccess := { Stmt_for_body5[i0] -> MemRef_x[4] };
ReadAccess := { Stmt_for_body5[i0] -> MemRef_x[0] };
...
Consequently, the PollyDependence pass would produce very complex Read/Write/MayWrite as follows:
Read: { Stmt_for_body[i0] -> MemRef_in[i0] : i0 >= 0 and i0 <= 15; Stmt_for_body5[i0] -> MemRef_x[15] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[9] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[4] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[11] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[2] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[6] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[14] : (i0 >= 0 and i0 <= 9) o!
r (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[8] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[10] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[0] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[13] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[1] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[3] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_fo!
r_body5[i0] -> MemRef_x[7] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0
<= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[12] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[5] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) }
Write: ...
As we can see, the reason why PollyDependence leads to expensive compile-time lies in the very complex Read/Write/MayWrite structure. In fact, we can run isl_union_map_coalesce() on the Read/MayWrite/MustWrite before feeding them into ISL analysis:
diff --git a/lib/Analysis/Dependences.cpp b/lib/Analysis/Dependences.cpp index 9f918f3..39c3fb6 100644 --- a/lib/Analysis/Dependences.cpp +++ b/lib/Analysis/Dependences.cpp @@ -95,6 +95,10 @@ void Dependences::calculateDependences(Scop &S) { collectInfo(S, &Read, &Write, &MayWrite, &Schedule); + Read = isl_union_map_coalesce(Read); + Write = isl_union_map_coalesce(Write); + MayWrite = isl_union_map_coalesce(MayWrite); + DEBUG(dbgs() << "Read: " << Read << "\n"
With this patch file, we can reduce the compile-time percentage of PollyDependence from 98.5% to 15.3%. Unfortunately, the compile-time percentage of PollyDependence for benchmarks, such as "nestedloop", is still very high.
My plan is to continue investigating why PollyDependence pass leads to such high compile-time overhead. Could anyone give me some comments or suggestions?
Thanks,
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130808/145c2f20/attachment.html>
More information about the llvm-dev
mailing list