<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><span style="white-space: pre-wrap; font-size: 14px; line-height: 1.7;">Hi Sebpop,</span><br><pre>Thanks for your explanation. </pre><pre><span style="font-size: 14px; line-height: 1.7;">I noticed that Polly would finally run the SROA pass to transform these load/store instructions into scalar operations. </span><span style="font-size: 14px; line-height: 1.7;">Is it possible to run such a pass before polly-dependence analysis?</span></pre><pre><span style="font-size: 14px; line-height: 1.7;">Star Tan</span></pre><pre>At 2013-08-15 21:12:53,"Sebastian Pop" <sebpop@gmail.com> wrote:
>Codeprepare and independent blocks are introducing these loads and stores.
>These are prepasses that polly runs prior to building the dependence graph
>to transform scalar dependences into data dependences.
>Ether was working on eliminating the rewrite of scalar dependences.
>
>On Thu, Aug 15, 2013 at 5:32 AM, Star Tan <tanmx_star@yeah.net> wrote:
>> Hi all,
>>
>> I have investigated the 6X extra compile-time overhead when Polly compiles
>> the simple nestedloop benchmark in LLVM-testsuite.
>> (http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28).
>> Preliminary results show that such compile-time overhead is resulted by the
>> complicated polly-dependence analysis. However, the key seems to be the
>> polly-prepare pass, which introduces a large number of store instructions,
>> and thus significantly complicating the polly-dependence pass.
>>
>> Let me show the results with a tiny code example:
>>
>> int main(int argc, char *argv[]) {
>> int n = ((argc == 2) ? atoi(argv[1]) : 46);
>> int a, b, x=0;
>> for (a=0; a<n; a++)
>> for (b=0; b<n; b++)
>> x++;
>> printf("%d\n", x);
>> return(0);
>> }
>>
>> The basic LLVM IR code (produced by "clang -O1") is shown as:
>>
>> @.str = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1
>> ; Function Attrs: nounwind uwtable
>> define i32 @main(i32 %argc, i8** nocapture readonly %argv) {
>> entry:
>> %cmp = icmp eq i32 %argc, 2
>> br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph
>> cond.end:
>> %arrayidx = getelementptr inbounds i8** %argv, i64 1
>> %0 = load i8** %arrayidx, align 8
>> %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 (i8*,
>> ...)*)(i8* %0) #3
>> %cmp117 = icmp sgt i32 %call, 0
>> br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8
>> for.cond2.preheader.lr.ph:
>> %cond22 = phi i32 [ %call, %cond.end ], [ 46, %entry ]
>> %cmp314 = icmp sgt i32 %cond22, 0
>> br label %for.cond2.preheader
>> for.cond2.preheader:
>> %x.019 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %x.1.lcssa,
>> %for.inc6 ]
>> %a.018 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %inc7, %for.inc6 ]
>> br i1 %cmp314, label %for.body4, label %for.inc6
>> for.body4:
>> %x.116 = phi i32 [ %inc, %for.body4 ], [ %x.019, %for.cond2.preheader ]
>> %b.015 = phi i32 [ %inc5, %for.body4 ], [ 0, %for.cond2.preheader ]
>> %inc = add nsw i32 %x.116, 1
>> %inc5 = add nsw i32 %b.015, 1
>> %cmp3 = icmp slt i32 %inc5, %cond22
>> br i1 %cmp3, label %for.body4, label %for.inc6
>> for.inc6:
>> %x.1.lcssa = phi i32 [ %x.019, %for.cond2.preheader ], [ %inc, %for.body4
>> ]
>> %inc7 = add nsw i32 %a.018, 1
>> %cmp1 = icmp slt i32 %inc7, %cond22
>> br i1 %cmp1, label %for.cond2.preheader, label %for.end8
>> for.end8:
>> %x.0.lcssa = phi i32 [ 0, %cond.end ], [ %x.1.lcssa, %for.inc6 ]
>> %call9 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4
>> x i8]* @.str, i64 0, i64 0), i32 %x.0.lcssa) #3
>> ret i32 0
>> }
>> declare i32 @atoi(...)
>> declare i32 @printf(i8* nocapture readonly, ...)
>>
>> Such code is very simple and there is no memory instruction at all, so the
>> polly-dependence pass runs very fast on this code. Unfortunately, when we
>> run "opt -load LLVMPolly.so" for this basic LLVM IR code, the polly-prepare
>> pass would introduce a large number of store instructions like this:
>>
>> define i32 @main(i32 %argc, i8** nocapture readonly %argv) {
>> entry:
>> %cond22.reg2mem = alloca i32
>> %x.019.reg2mem = alloca i32
>> %x.1.lcssa.reg2mem = alloca i32
>> %x.1.lcssa.lcssa.reg2mem = alloca i32
>> %x.0.lcssa.reg2mem = alloca i32
>> br label %entry.split
>> entry.split:
>> %cmp = icmp eq i32 %argc, 2
>> store i32 46, i32* %cond22.reg2mem
>> br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph
>> cond.end:
>> %arrayidx = getelementptr inbounds i8** %argv, i64 1
>> %0 = load i8** %arrayidx, align 8
>> %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 (i8*,
>> ...)*)(i8* %0)
>> %cmp117 = icmp sgt i32 %call, 0
>> store i32 0, i32* %x.0.lcssa.reg2mem
>> store i32 %call, i32* %cond22.reg2mem
>> br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8
>> ...
>>
>> These store instructions significantly complicate the "polly-dependence"
>> pass, and thus leading to high compile-time overhead.
>>
>> I have noticed that such memory instructions are finally simplified to
>> scalar operations by the SROA pass, so one possible way to reduce such
>> compile-time overhead is to move the SROA pass ahead of polly-dependence
>> analysis.
>>
>> Can anyone give me some hints that why the polly-prepare pass introduces
>> such memory instructions? Is it possible to move the SROA pass ahead of
>> polly-dependence analysis?
>>
>> Thanks,
>> Star Tan
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Polly Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to polly-dev+unsubscribe@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
</pre></div>