[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

Thu Aug 15 03:32:57 PDT 2013

Hi all,

I have investigated the 6X extra compile-time overhead when Polly compiles the simple nestedloop benchmark in LLVM-testsuite. (http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28). Preliminary results show that such compile-time overhead is resulted by the complicated polly-dependence analysis. However, the key seems to be the polly-prepare pass, which introduces a large number of store instructions, and thus significantly complicating the polly-dependence pass.

Let me show the results with a tiny code example:

int main(int argc, char *argv[]) {
  int n = ((argc == 2) ? atoi(argv[1]) : 46);
  int a, b, x=0;
  for (a=0; a<n; a++)
    for (b=0; b<n; b++)
        x++;
  printf("%d\n", x);
  return(0);
}

The basic LLVM IR code (produced by "clang -O1") is shown as:

@.str = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1
; Function Attrs: nounwind uwtable
define i32 @main(i32 %argc, i8** nocapture readonly %argv) {
entry:
  %cmp = icmp eq i32 %argc, 2
  br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph
cond.end:                                        
  %arrayidx = getelementptr inbounds i8** %argv, i64 1
  %0 = load i8** %arrayidx, align 8
  %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 (i8*, ...)*)(i8* %0) #3
  %cmp117 = icmp sgt i32 %call, 0
  br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8
for.cond2.preheader.lr.ph:                    
  %cond22 = phi i32 [ %call, %cond.end ], [ 46, %entry ]
  %cmp314 = icmp sgt i32 %cond22, 0
  br label %for.cond2.preheader
for.cond2.preheader:                              
  %x.019 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %x.1.lcssa, %for.inc6 ]
  %a.018 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %inc7, %for.inc6 ]
  br i1 %cmp314, label %for.body4, label %for.inc6
for.body4:                                        
  %x.116 = phi i32 [ %inc, %for.body4 ], [ %x.019, %for.cond2.preheader ]
  %b.015 = phi i32 [ %inc5, %for.body4 ], [ 0, %for.cond2.preheader ]
  %inc = add nsw i32 %x.116, 1
  %inc5 = add nsw i32 %b.015, 1
  %cmp3 = icmp slt i32 %inc5, %cond22
  br i1 %cmp3, label %for.body4, label %for.inc6
for.inc6:                                         
  %x.1.lcssa = phi i32 [ %x.019, %for.cond2.preheader ], [ %inc, %for.body4 ]
  %inc7 = add nsw i32 %a.018, 1
  %cmp1 = icmp slt i32 %inc7, %cond22
  br i1 %cmp1, label %for.cond2.preheader, label %for.end8
for.end8:                                        
  %x.0.lcssa = phi i32 [ 0, %cond.end ], [ %x.1.lcssa, %for.inc6 ]
  %call9 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), i32 %x.0.lcssa) #3
  ret i32 0
}
declare i32 @atoi(...)
declare i32 @printf(i8* nocapture readonly, ...)

Such code is very simple and there is no memory instruction at all, so the polly-dependence pass runs very fast on this code. Unfortunately, when we run "opt -load LLVMPolly.so" for this basic LLVM IR code, the polly-prepare pass would introduce a large number of store instructions like this:

define i32 @main(i32 %argc, i8** nocapture readonly %argv) {
entry:
  %cond22.reg2mem = alloca i32
  %x.019.reg2mem = alloca i32
  %x.1.lcssa.reg2mem = alloca i32
  %x.1.lcssa.lcssa.reg2mem = alloca i32
  %x.0.lcssa.reg2mem = alloca i32
  br label %entry.split
entry.split:        
  %cmp = icmp eq i32 %argc, 2
  store i32 46, i32* %cond22.reg2mem
  br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph
cond.end:                            
  %arrayidx = getelementptr inbounds i8** %argv, i64 1
  %0 = load i8** %arrayidx, align 8
  %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 (i8*, ...)*)(i8* %0)
  %cmp117 = icmp sgt i32 %call, 0
  store i32 0, i32* %x.0.lcssa.reg2mem
  store i32 %call, i32* %cond22.reg2mem
  br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8
  ...

These store instructions significantly complicate the "polly-dependence" pass, and thus leading to high compile-time overhead.

I have noticed that such memory instructions are finally simplified to scalar operations by the SROA pass, so one possible way to reduce such compile-time overhead is to move the SROA pass ahead of polly-dependence analysis.

Can anyone give me some hints that why the polly-prepare pass introduces such memory instructions? Is it possible to move the SROA pass ahead of polly-dependence analysis?

Thanks,
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130815/e379df67/attachment.html>