[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
Star Tan
tanmx_star at yeah.net
Thu Aug 15 03:32:57 PDT 2013
Hi all,
I have investigated the 6X extra compile-time overhead when Polly compiles the simple nestedloop benchmark in LLVM-testsuite. (http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28). Preliminary results show that such compile-time overhead is resulted by the complicated polly-dependence analysis. However, the key seems to be the polly-prepare pass, which introduces a large number of store instructions, and thus significantly complicating the polly-dependence pass.
Let me show the results with a tiny code example:
int main(int argc, char *argv[]) {
int n = ((argc == 2) ? atoi(argv[1]) : 46);
int a, b, x=0;
for (a=0; a<n; a++)
for (b=0; b<n; b++)
x++;
printf("%d\n", x);
return(0);
}
The basic LLVM IR code (produced by "clang -O1") is shown as:
@.str = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1
; Function Attrs: nounwind uwtable
define i32 @main(i32 %argc, i8** nocapture readonly %argv) {
entry:
%cmp = icmp eq i32 %argc, 2
br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph
cond.end:
%arrayidx = getelementptr inbounds i8** %argv, i64 1
%0 = load i8** %arrayidx, align 8
%call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 (i8*, ...)*)(i8* %0) #3
%cmp117 = icmp sgt i32 %call, 0
br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8
for.cond2.preheader.lr.ph:
%cond22 = phi i32 [ %call, %cond.end ], [ 46, %entry ]
%cmp314 = icmp sgt i32 %cond22, 0
br label %for.cond2.preheader
for.cond2.preheader:
%x.019 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %x.1.lcssa, %for.inc6 ]
%a.018 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %inc7, %for.inc6 ]
br i1 %cmp314, label %for.body4, label %for.inc6
for.body4:
%x.116 = phi i32 [ %inc, %for.body4 ], [ %x.019, %for.cond2.preheader ]
%b.015 = phi i32 [ %inc5, %for.body4 ], [ 0, %for.cond2.preheader ]
%inc = add nsw i32 %x.116, 1
%inc5 = add nsw i32 %b.015, 1
%cmp3 = icmp slt i32 %inc5, %cond22
br i1 %cmp3, label %for.body4, label %for.inc6
for.inc6:
%x.1.lcssa = phi i32 [ %x.019, %for.cond2.preheader ], [ %inc, %for.body4 ]
%inc7 = add nsw i32 %a.018, 1
%cmp1 = icmp slt i32 %inc7, %cond22
br i1 %cmp1, label %for.cond2.preheader, label %for.end8
for.end8:
%x.0.lcssa = phi i32 [ 0, %cond.end ], [ %x.1.lcssa, %for.inc6 ]
%call9 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), i32 %x.0.lcssa) #3
ret i32 0
}
declare i32 @atoi(...)
declare i32 @printf(i8* nocapture readonly, ...)
Such code is very simple and there is no memory instruction at all, so the polly-dependence pass runs very fast on this code. Unfortunately, when we run "opt -load LLVMPolly.so" for this basic LLVM IR code, the polly-prepare pass would introduce a large number of store instructions like this:
define i32 @main(i32 %argc, i8** nocapture readonly %argv) {
entry:
%cond22.reg2mem = alloca i32
%x.019.reg2mem = alloca i32
%x.1.lcssa.reg2mem = alloca i32
%x.1.lcssa.lcssa.reg2mem = alloca i32
%x.0.lcssa.reg2mem = alloca i32
br label %entry.split
entry.split:
%cmp = icmp eq i32 %argc, 2
store i32 46, i32* %cond22.reg2mem
br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph
cond.end:
%arrayidx = getelementptr inbounds i8** %argv, i64 1
%0 = load i8** %arrayidx, align 8
%call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 (i8*, ...)*)(i8* %0)
%cmp117 = icmp sgt i32 %call, 0
store i32 0, i32* %x.0.lcssa.reg2mem
store i32 %call, i32* %cond22.reg2mem
br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8
...
These store instructions significantly complicate the "polly-dependence" pass, and thus leading to high compile-time overhead.
I have noticed that such memory instructions are finally simplified to scalar operations by the SROA pass, so one possible way to reduce such compile-time overhead is to move the SROA pass ahead of polly-dependence analysis.
Can anyone give me some hints that why the polly-prepare pass introduces such memory instructions? Is it possible to move the SROA pass ahead of polly-dependence analysis?
Thanks,
Star Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130815/e379df67/attachment.html>
More information about the llvm-dev
mailing list