<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial">I see. Thank you.<div>Do you have any ideas to reduce such expensive dependence analysis caused by those extra load/store instructions?</div><div>Any suggestion would be appreciated. <span style="font-size: 14px; line-height: 1.7;">Thanks~</span></div><div><div><div><pre>At 2013-08-16 09:38:51,"Sebastian Pop" <sebpop@gmail.com> wrote:

>I do not think that running SROA before polly is a good idea:

>it would defeat the purpose of the code preparation passes that

>polly intentionally schedules for the data dependence analysis.

>If you remove the data references before polly runs, you would

>miss them in the dependence graph: that could lead to incorrect

>transforms.

>

>

>On Thu, Aug 15, 2013 at 7:28 PM, Star Tan <tanmx_star@yeah.net> wrote:

>> Hi Sebpop,

>>

>> Thanks for your explanation.

>>

>> I noticed that Polly would finally run the SROA pass to transform these

>> load/store instructions into scalar operations.  Is it possible to run such

>> a pass before polly-dependence analysis?

>>

>> Star Tan

>>

>> At 2013-08-15 21:12:53,"Sebastian Pop" <sebpop@gmail.com> wrote:

>>>Codeprepare and independent blocks are introducing these loads and stores.

>>>These are prepasses that polly runs prior to building the dependence graph

>>>to transform scalar dependences into data dependences.

>>>Ether was working on eliminating the rewrite of scalar dependences.

>>>

>>>On Thu, Aug 15, 2013 at 5:32 AM, Star Tan <tanmx_star@yeah.net> wrote:

>>>> Hi all,

>>>>

>>>> I have investigated the 6X extra compile-time overhead when Polly

>>>> compiles

>>>> the simple nestedloop benchmark in LLVM-testsuite.

>>>>

>>>> (http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28).

>>>> Preliminary results show that such compile-time overhead is resulted by

>>>> the

>>>> complicated polly-dependence analysis. However, the key seems to be the

>>>> polly-prepare pass, which introduces a large number of store

>>>> instructions,

>>>> and thus significantly complicating the polly-dependence pass.

>>>>

>>>> Let me show the results with a tiny code example:

>>>>

>>>> int main(int argc, char *argv[]) {

>>>>   int n = ((argc == 2) ? atoi(argv[1]) : 46);

>>>>   int a, b, x=0;

>>>>   for (a=0; a<n; a++)

>>>>     for (b=0; b<n; b++)

>>>>         x++;

>>>>   printf("%d\n", x);

>>>>   return(0);

>>>> }

>>>>

>>>> The basic LLVM IR code (produced by "clang -O1") is shown as:

>>>>

>>>> @.str = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1

>>>> ; Function Attrs: nounwind uwtable

>>>> define i32 @main(i32 %argc, i8** nocapture readonly %argv) {

>>>> entry:

>>>>   %cmp = icmp eq i32 %argc, 2

>>>>   br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph

>>>> cond.end:

>>>>   %arrayidx = getelementptr inbounds i8** %argv, i64 1

>>>>   %0 = load i8** %arrayidx, align 8

>>>>   %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32

>>>> (i8*,

>>>> ...)*)(i8* %0) #3

>>>>   %cmp117 = icmp sgt i32 %call, 0

>>>>   br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8

>>>> for.cond2.preheader.lr.ph:

>>>>   %cond22 = phi i32 [ %call, %cond.end ], [ 46, %entry ]

>>>>   %cmp314 = icmp sgt i32 %cond22, 0

>>>>   br label %for.cond2.preheader

>>>> for.cond2.preheader:

>>>>   %x.019 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %x.1.lcssa,

>>>> %for.inc6 ]

>>>>   %a.018 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %inc7, %for.inc6

>>>> ]

>>>>   br i1 %cmp314, label %for.body4, label %for.inc6

>>>> for.body4:

>>>>   %x.116 = phi i32 [ %inc, %for.body4 ], [ %x.019, %for.cond2.preheader ]

>>>>   %b.015 = phi i32 [ %inc5, %for.body4 ], [ 0, %for.cond2.preheader ]

>>>>   %inc = add nsw i32 %x.116, 1

>>>>   %inc5 = add nsw i32 %b.015, 1

>>>>   %cmp3 = icmp slt i32 %inc5, %cond22

>>>>   br i1 %cmp3, label %for.body4, label %for.inc6

>>>> for.inc6:

>>>>   %x.1.lcssa = phi i32 [ %x.019, %for.cond2.preheader ], [ %inc,

>>>> %for.body4

>>>> ]

>>>>   %inc7 = add nsw i32 %a.018, 1

>>>>   %cmp1 = icmp slt i32 %inc7, %cond22

>>>>   br i1 %cmp1, label %for.cond2.preheader, label %for.end8

>>>> for.end8:

>>>>   %x.0.lcssa = phi i32 [ 0, %cond.end ], [ %x.1.lcssa, %for.inc6 ]

>>>>   %call9 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds

>>>> ([4

>>>> x i8]* @.str, i64 0, i64 0), i32 %x.0.lcssa) #3

>>>>   ret i32 0

>>>> }

>>>> declare i32 @atoi(...)

>>>> declare i32 @printf(i8* nocapture readonly, ...)

>>>>

>>>> Such code is very simple and there is no memory instruction at all, so

>>>> the

>>>> polly-dependence pass runs very fast on this code. Unfortunately, when we

>>>> run "opt -load LLVMPolly.so" for this basic LLVM IR code, the

>>>> polly-prepare

>>>> pass would introduce a large number of store instructions like this:

>>>>

>>>> define i32 @main(i32 %argc, i8** nocapture readonly %argv) {

>>>> entry:

>>>>   %cond22.reg2mem = alloca i32

>>>>   %x.019.reg2mem = alloca i32

>>>>   %x.1.lcssa.reg2mem = alloca i32

>>>>   %x.1.lcssa.lcssa.reg2mem = alloca i32

>>>>   %x.0.lcssa.reg2mem = alloca i32

>>>>   br label %entry.split

>>>> entry.split:

>>>>   %cmp = icmp eq i32 %argc, 2

>>>>   store i32 46, i32* %cond22.reg2mem

>>>>   br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph

>>>> cond.end:

>>>>   %arrayidx = getelementptr inbounds i8** %argv, i64 1

>>>>   %0 = load i8** %arrayidx, align 8

>>>>   %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32

>>>> (i8*,

>>>> ...)*)(i8* %0)

>>>>   %cmp117 = icmp sgt i32 %call, 0

>>>>   store i32 0, i32* %x.0.lcssa.reg2mem

>>>>   store i32 %call, i32* %cond22.reg2mem

>>>>   br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8

>>>>   ...

>>>>

>>>> These store instructions significantly complicate the "polly-dependence"

>>>> pass, and thus leading to high compile-time overhead.

>>>>

>>>> I have noticed that such memory instructions are finally simplified to

>>>> scalar operations by the SROA pass, so one possible way to reduce such

>>>> compile-time overhead is to move the SROA pass ahead of polly-dependence

>>>> analysis.

>>>>

>>>> Can anyone give me some hints that why the polly-prepare pass introduces

>>>> such memory instructions? Is it possible to move the SROA pass ahead of

>>>> polly-dependence analysis?

>>>>

>>>> Thanks,

>>>> Star Tan

>>>>

>>>> --

>>>> You received this message because you are subscribed to the Google Groups

>>>> "Polly Development" group.

>>>> To unsubscribe from this group and stop receiving emails from it, send an

>>>> email to polly-dev+unsubscribe@googlegroups.com.

>>>> For more options, visit https://groups.google.com/groups/opt_out.

</pre></div></div></div></div>