[LLVMdev] [Polly] Summary of some expensive compiler passes, especially PollyDependence

Thu Aug 8 01:29:11 PDT 2013

Hi all, 

I have summarized the top 10 compiler passes for Polly when compiling LLVM test-ssuite. Results can be viewed on:
    https://gist.github.com/tanstar/581bcea1e4e03498f935/raw/f6a4ec4e8565f7a7bbdb924cd59fcf145caac039/Polly-top10

Based on the comparison between "clang -O3" and "polly -O3" listed on:
    http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=14&baseline=14

we can see Polly's compile-time overhead is mainly resulted by some expensive Polly passes such as PollyDependence, PollyOptimization and PollyCodegen. Especially, I notice that the PollyDependence can lead to significant extra compile-time overhead. Its compile-time percentage for some expensive benchmarks can be summarized as:
    nestedloop:  41.4% (Polly - Calculate dependence)
    salsa20:       98.5% (Polly - Calculate dependence)
    seidel-2d:     72.1% (Polly - Calculate dependence)
    multiplies:     54.3% (Poly -  Calculate dependence)
    Puzzle:         22.8% (Poly -  Calculate dependence)

As a result, it is critical to improve the PollyDependence pass.  I have previously committed a patch file to split start value from SCEVAddRecExpr (r187728), which can improve the PollyDependence pass for some benchmarks as follows:
     http://188.40.87.11:8000/db_default/v4/nts/25?baseline=18&compare_to=18
Obviously, it is not enough.

I have investigated  the salsa20 benchmark, in which the PollyDependence pass consumes 98.5% of total compile-time. The key code of salsa20 benchmark consists of a serial of array operations:
  #define R(a,b) (((a) << (b)) | ((a) >> (32 - (b))))
  for (i = 20;i > 0;i -= 2) {
    x[ 4] ^= R(x[ 0]+x[12], 7);  x[ 8] ^= R(x[ 4]+x[ 0], 9);
    x[12] ^= R(x[ 8]+x[ 4],13);  x[ 0] ^= R(x[12]+x[ 8],18);
    ....
    x[14] ^= R(x[13]+x[12],13);  x[15] ^= R(x[14]+x[13],18);
  }
which would be translated into a serial of affine functions:
      ReadAccess :=    { Stmt_for_body5[i0] -> MemRef_x[12] };
      ReadAccess :=    { Stmt_for_body5[i0] -> MemRef_x[4] };
      MustWriteAccess :=    { Stmt_for_body5[i0] -> MemRef_x[4] };
      ReadAccess :=    { Stmt_for_body5[i0] -> MemRef_x[0] };
      ...
Consequently, the PollyDependence pass would produce very complex Read/Write/MayWrite as follows:

      Read: { Stmt_for_body[i0] -> MemRef_in[i0] : i0 >= 0 and i0 <= 15; Stmt_for_body5[i0] -> MemRef_x[15] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[9] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[4] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[11] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[2] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[6] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[14] : (i0 >= 0 and i0 <= 9) o!
 r (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[8] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[10] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[0] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[13] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[1] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[3] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_fo!
 r_body5[i0] -> MemRef_x[7] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 
<= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[12] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[5] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) }
      Write: ...

As we can see, the reason why PollyDependence leads to expensive compile-time lies in the very complex Read/Write/MayWrite structure.  In fact, we can run isl_union_map_coalesce() on the Read/MayWrite/MustWrite before feeding them into ISL analysis:

diff --git a/lib/Analysis/Dependences.cpp b/lib/Analysis/Dependences.cpp index 9f918f3..39c3fb6 100644 --- a/lib/Analysis/Dependences.cpp +++ b/lib/Analysis/Dependences.cpp @@ -95,6 +95,10 @@ void Dependences::calculateDependences(Scop &S) { collectInfo(S, &Read, &Write, &MayWrite, &Schedule); + Read = isl_union_map_coalesce(Read); + Write = isl_union_map_coalesce(Write); + MayWrite = isl_union_map_coalesce(MayWrite); + DEBUG(dbgs() << "Read: " << Read << "\n"

With this patch file, we can reduce the compile-time percentage of PollyDependence from 98.5% to 15.3%. Unfortunately, the compile-time percentage of PollyDependence for benchmarks, such as "nestedloop", is still very high.  

My plan is to continue investigating why PollyDependence pass leads to such high compile-time overhead. Could anyone give me some comments or suggestions?

Thanks,
Star Tan

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130808/145c2f20/attachment.html>