<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><div>Hi all,</div><div><br></div><div>I have investigated the 6X extra compile-time overhead when Polly compiles the simple nestedloop benchmark in LLVM-testsuite. (<a href="http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28" style="font-size: 14px; line-height: 1.7;">http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28</a><span style="font-size: 14px; line-height: 1.7;">). Preliminary results show that such compile-time overhead is resulted by the complicated polly-dependence analysis. However, the key seems to be the polly-prepare pass, which introduces a large number of store instructions, and thus significantly complicating the polly-dependence pass.</span></div><div><br></div><div>Let me show the results with a tiny code example:</div><div><br></div><div>int main(int argc, char *argv[]) {</div><div> int n = ((argc == 2) ? atoi(argv[1]) :!
46);</div><div> int a, b, x=0;</div><div> for (a=0; a<n; a++)</div><div> for (b=0; b<n; b++)</div><div> x++;</div><div> printf("%d\n", x);</div><div> return(0);</div><div>}</div><div><br></div><div>The basic LLVM IR code (produced by "clang -O1") is shown as:</div><div><br></div><div>@.str = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1</div><div>; Function Attrs: nounwind uwtable</div><div>define i32 @main(i32 %argc, i8** nocapture readonly %argv) {</div><div>entry:</div><div> %cmp = icmp eq i32 %argc, 2</div><div> br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph</div><div>cond.end: </div><div> %arrayidx = getelementptr inbounds i8** %argv, i64 1</div><div> %0 = load i8** %arrayidx, align 8</div><div> %call = t!
ail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 (i8*, ...)*)
(i8* %0) #3</div><div> %cmp117 = icmp sgt i32 %call, 0</div><div> br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8</div><div>for.cond2.preheader.lr.ph: </div><div> %cond22 = phi i32 [ %call, %cond.end ], [ 46, %entry ]</div><div> %cmp314 = icmp sgt i32 %cond22, 0</div><div> br label %for.cond2.preheader</div><div>for.cond2.preheader: </div><div> %x.019 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %x.1.lcssa, %for.inc6 ]</div><div> %a.018 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %inc7, %for.inc6 ]</div><div> br i1 %cmp314, label %for.body4, label %for.inc6</div><div>for.body4: </div><div> %x.1!
16 = phi i32 [ %inc, %for.body4 ], [ %x.019, %for.cond2.preheader ]</div><div> %b.015 = phi i32 [ %inc5, %for.body4 ], [ 0, %for.cond2.preheader ]</div><div> %inc = add nsw i32 %x.116, 1</div><div> %inc5 = add nsw i32 %b.015, 1</div><div> %cmp3 = icmp slt i32 %inc5, %cond22</div><div> br i1 %cmp3, label %for.body4, label %for.inc6</div><div>for.inc6: </div><div> %x.1.lcssa = phi i32 [ %x.019, %for.cond2.preheader ], [ %inc, %for.body4 ]</div><div> %inc7 = add nsw i32 %a.018, 1</div><div> %cmp1 = icmp slt i32 %inc7, %cond22</div><div> br i1 %cmp1, label %for.cond2.preheader, label %for.end8</div><div>for.end8: </div><div> %x.0.lcssa = ph!
i i32 [ 0, %cond.end ], [ %x.1.lcssa, %for.inc6 ]</div><div> %ca
ll9 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), i32 %x.0.lcssa) #3</div><div> ret i32 0</div><div>}</div><div>declare i32 @atoi(...)</div><div>declare i32 @printf(i8* nocapture readonly, ...)</div><div><br></div><div>Such code is very simple and there is no memory instruction at all, so the polly-dependence pass runs very fast on this code. Unfortunately, when we run "opt -load LLVMPolly.so" for this basic LLVM IR code, the polly-prepare pass would introduce a large number of store instructions like this:</div><div><br></div><div>define i32 @main(i32 %argc, i8** nocapture readonly %argv) {</div><div>entry:</div><div> %cond22.reg2mem = alloca i32</div><div> %x.019.reg2mem = alloca i32</div><div> %x.1.lcssa.reg2mem = alloca i32</div><div> %x.1.lcssa.lcssa.reg2mem = alloca i32</div><div> %x.0.lcssa.reg2mem = alloca i32</div><div> br label %entry.split</div><div>entry.split: &nbs!
p; </div><div> %cmp = icmp eq i32 %argc, 2</div><div> store i32 46, i32* %cond22.reg2mem</div><div> br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph</div><div>cond.end: </div><div> %arrayidx = getelementptr inbounds i8** %argv, i64 1</div><div> %0 = load i8** %arrayidx, align 8</div><div> %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 (i8*, ...)*)(i8* %0)</div><div> %cmp117 = icmp sgt i32 %call, 0</div><div> store i32 0, i32* %x.0.lcssa.reg2mem</div><div> store i32 %call, i32* %cond22.reg2mem</div><div> br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8</div><div> ...</div><div> </div><div>These store instructions significantly complicate the "polly-dependence" pass, and thus leading to high compile-time overhead.</div><div><br></div><div>I hav!
e noticed that such memory instructions are finally simplified to scal
ar operations by the SROA pass, so one possible way to reduce such compile-time overhead is to move <span style="font-size: 14px; line-height: 1.7;">the SROA pass ahead of </span><span style="font-size: 14px; line-height: 1.7;">polly-dependence analysis.</span></div><div><br></div><div>Can anyone give me some hints that why the polly-prepare pass introduces such memory instructions? Is it possible to move the SROA pass ahead of polly-dependence analysis?</div><div><br></div><div>Thanks,</div><div>Star Tan</div></div>