[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

Thu Aug 15 06:12:53 PDT 2013

Codeprepare and independent blocks are introducing these loads and stores.
These are prepasses that polly runs prior to building the dependence graph
to transform scalar dependences into data dependences.
Ether was working on eliminating the rewrite of scalar dependences.

On Thu, Aug 15, 2013 at 5:32 AM, Star Tan <tanmx_star at yeah.net> wrote:
> Hi all,
>
> I have investigated the 6X extra compile-time overhead when Polly compiles
> the simple nestedloop benchmark in LLVM-testsuite.
> (http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28).
> Preliminary results show that such compile-time overhead is resulted by the
> complicated polly-dependence analysis. However, the key seems to be the
> polly-prepare pass, which introduces a large number of store instructions,
> and thus significantly complicating the polly-dependence pass.
>
> Let me show the results with a tiny code example:
>
> int main(int argc, char *argv[]) {
>   int n = ((argc == 2) ? atoi(argv[1]) : 46);
>   int a, b, x=0;
>   for (a=0; a<n; a++)
>     for (b=0; b<n; b++)
>         x++;
>   printf("%d\n", x);
>   return(0);
> }
>
> The basic LLVM IR code (produced by "clang -O1") is shown as:
>
> @.str = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1
> ; Function Attrs: nounwind uwtable
> define i32 @main(i32 %argc, i8** nocapture readonly %argv) {
> entry:
>   %cmp = icmp eq i32 %argc, 2
>   br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph
> cond.end:
>   %arrayidx = getelementptr inbounds i8** %argv, i64 1
>   %0 = load i8** %arrayidx, align 8
>   %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 (i8*,
> ...)*)(i8* %0) #3
>   %cmp117 = icmp sgt i32 %call, 0
>   br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8
> for.cond2.preheader.lr.ph:
>   %cond22 = phi i32 [ %call, %cond.end ], [ 46, %entry ]
>   %cmp314 = icmp sgt i32 %cond22, 0
>   br label %for.cond2.preheader
> for.cond2.preheader:
>   %x.019 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %x.1.lcssa,
> %for.inc6 ]
>   %a.018 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %inc7, %for.inc6 ]
>   br i1 %cmp314, label %for.body4, label %for.inc6
> for.body4:
>   %x.116 = phi i32 [ %inc, %for.body4 ], [ %x.019, %for.cond2.preheader ]
>   %b.015 = phi i32 [ %inc5, %for.body4 ], [ 0, %for.cond2.preheader ]
>   %inc = add nsw i32 %x.116, 1
>   %inc5 = add nsw i32 %b.015, 1
>   %cmp3 = icmp slt i32 %inc5, %cond22
>   br i1 %cmp3, label %for.body4, label %for.inc6
> for.inc6:
>   %x.1.lcssa = phi i32 [ %x.019, %for.cond2.preheader ], [ %inc, %for.body4
> ]
>   %inc7 = add nsw i32 %a.018, 1
>   %cmp1 = icmp slt i32 %inc7, %cond22
>   br i1 %cmp1, label %for.cond2.preheader, label %for.end8
> for.end8:
>   %x.0.lcssa = phi i32 [ 0, %cond.end ], [ %x.1.lcssa, %for.inc6 ]
>   %call9 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4
> x i8]* @.str, i64 0, i64 0), i32 %x.0.lcssa) #3
>   ret i32 0
> }
> declare i32 @atoi(...)
> declare i32 @printf(i8* nocapture readonly, ...)
>
> Such code is very simple and there is no memory instruction at all, so the
> polly-dependence pass runs very fast on this code. Unfortunately, when we
> run "opt -load LLVMPolly.so" for this basic LLVM IR code, the polly-prepare
> pass would introduce a large number of store instructions like this:
>
> define i32 @main(i32 %argc, i8** nocapture readonly %argv) {
> entry:
>   %cond22.reg2mem = alloca i32
>   %x.019.reg2mem = alloca i32
>   %x.1.lcssa.reg2mem = alloca i32
>   %x.1.lcssa.lcssa.reg2mem = alloca i32
>   %x.0.lcssa.reg2mem = alloca i32
>   br label %entry.split
> entry.split:
>   %cmp = icmp eq i32 %argc, 2
>   store i32 46, i32* %cond22.reg2mem
>   br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph
> cond.end:
>   %arrayidx = getelementptr inbounds i8** %argv, i64 1
>   %0 = load i8** %arrayidx, align 8
>   %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 (i8*,
> ...)*)(i8* %0)
>   %cmp117 = icmp sgt i32 %call, 0
>   store i32 0, i32* %x.0.lcssa.reg2mem
>   store i32 %call, i32* %cond22.reg2mem
>   br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8
>   ...
>
> These store instructions significantly complicate the "polly-dependence"
> pass, and thus leading to high compile-time overhead.
>
> I have noticed that such memory instructions are finally simplified to
> scalar operations by the SROA pass, so one possible way to reduce such
> compile-time overhead is to move the SROA pass ahead of polly-dependence
> analysis.
>
> Can anyone give me some hints that why the polly-prepare pass introduces
> such memory instructions? Is it possible to move the SROA pass ahead of
> polly-dependence analysis?
>
> Thanks,
> Star Tan
>
> --
> You received this message because you are subscribed to the Google Groups
> "Polly Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to polly-dev+unsubscribe at googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.