[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

Thu Aug 15 18:48:12 PDT 2013

I see. Thank you.
Do you have any ideas to reduce such expensive dependence analysis caused by those extra load/store instructions?
Any suggestion would be appreciated. Thanks~
At 2013-08-16 09:38:51,"Sebastian Pop" <sebpop at gmail.com> wrote:
>I do not think that running SROA before polly is a good idea:
>it would defeat the purpose of the code preparation passes that
>polly intentionally schedules for the data dependence analysis.
>If you remove the data references before polly runs, you would
>miss them in the dependence graph: that could lead to incorrect
>transforms.
>
>
>On Thu, Aug 15, 2013 at 7:28 PM, Star Tan <tanmx_star at yeah.net> wrote:
>> Hi Sebpop,
>>
>> Thanks for your explanation.
>>
>> I noticed that Polly would finally run the SROA pass to transform these
>> load/store instructions into scalar operations.  Is it possible to run such
>> a pass before polly-dependence analysis?
>>
>> Star Tan
>>
>> At 2013-08-15 21:12:53,"Sebastian Pop" <sebpop at gmail.com> wrote:
>>>Codeprepare and independent blocks are introducing these loads and stores.
>>>These are prepasses that polly runs prior to building the dependence graph
>>>to transform scalar dependences into data dependences.
>>>Ether was working on eliminating the rewrite of scalar dependences.
>>>
>>>On Thu, Aug 15, 2013 at 5:32 AM, Star Tan <tanmx_star at yeah.net> wrote:
>>>> Hi all,
>>>>
>>>> I have investigated the 6X extra compile-time overhead when Polly
>>>> compiles
>>>> the simple nestedloop benchmark in LLVM-testsuite.
>>>>
>>>> (http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28).
>>>> Preliminary results show that such compile-time overhead is resulted by
>>>> the
>>>> complicated polly-dependence analysis. However, the key seems to be the
>>>> polly-prepare pass, which introduces a large number of store
>>>> instructions,
>>>> and thus significantly complicating the polly-dependence pass.
>>>>
>>>> Let me show the results with a tiny code example:
>>>>
>>>> int main(int argc, char *argv[]) {
>>>>   int n = ((argc == 2) ? atoi(argv[1]) : 46);
>>>>   int a, b, x=0;
>>>>   for (a=0; a<n; a++)
>>>>     for (b=0; b<n; b++)
>>>>         x++;
>>>>   printf("%d\n", x);
>>>>   return(0);
>>>> }
>>>>
>>>> The basic LLVM IR code (produced by "clang -O1") is shown as:
>>>>
>>>> @.str = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1
>>>> ; Function Attrs: nounwind uwtable
>>>> define i32 @main(i32 %argc, i8** nocapture readonly %argv) {
>>>> entry:
>>>>   %cmp = icmp eq i32 %argc, 2
>>>>   br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph
>>>> cond.end:
>>>>   %arrayidx = getelementptr inbounds i8** %argv, i64 1
>>>>   %0 = load i8** %arrayidx, align 8
>>>>   %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32
>>>> (i8*,
>>>> ...)*)(i8* %0) #3
>>>>   %cmp117 = icmp sgt i32 %call, 0
>>>>   br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8
>>>> for.cond2.preheader.lr.ph:
>>>>   %cond22 = phi i32 [ %call, %cond.end ], [ 46, %entry ]
>>>>   %cmp314 = icmp sgt i32 %cond22, 0
>>>>   br label %for.cond2.preheader
>>>> for.cond2.preheader:
>>>>   %x.019 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %x.1.lcssa,
>>>> %for.inc6 ]
>>>>   %a.018 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %inc7, %for.inc6
>>>> ]
>>>>   br i1 %cmp314, label %for.body4, label %for.inc6
>>>> for.body4:
>>>>   %x.116 = phi i32 [ %inc, %for.body4 ], [ %x.019, %for.cond2.preheader ]
>>>>   %b.015 = phi i32 [ %inc5, %for.body4 ], [ 0, %for.cond2.preheader ]
>>>>   %inc = add nsw i32 %x.116, 1
>>>>   %inc5 = add nsw i32 %b.015, 1
>>>>   %cmp3 = icmp slt i32 %inc5, %cond22
>>>>   br i1 %cmp3, label %for.body4, label %for.inc6
>>>> for.inc6:
>>>>   %x.1.lcssa = phi i32 [ %x.019, %for.cond2.preheader ], [ %inc,
>>>> %for.body4
>>>> ]
>>>>   %inc7 = add nsw i32 %a.018, 1
>>>>   %cmp1 = icmp slt i32 %inc7, %cond22
>>>>   br i1 %cmp1, label %for.cond2.preheader, label %for.end8
>>>> for.end8:
>>>>   %x.0.lcssa = phi i32 [ 0, %cond.end ], [ %x.1.lcssa, %for.inc6 ]
>>>>   %call9 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds
>>>> ([4
>>>> x i8]* @.str, i64 0, i64 0), i32 %x.0.lcssa) #3
>>>>   ret i32 0
>>>> }
>>>> declare i32 @atoi(...)
>>>> declare i32 @printf(i8* nocapture readonly, ...)
>>>>
>>>> Such code is very simple and there is no memory instruction at all, so
>>>> the
>>>> polly-dependence pass runs very fast on this code. Unfortunately, when we
>>>> run "opt -load LLVMPolly.so" for this basic LLVM IR code, the
>>>> polly-prepare
>>>> pass would introduce a large number of store instructions like this:
>>>>
>>>> define i32 @main(i32 %argc, i8** nocapture readonly %argv) {
>>>> entry:
>>>>   %cond22.reg2mem = alloca i32
>>>>   %x.019.reg2mem = alloca i32
>>>>   %x.1.lcssa.reg2mem = alloca i32
>>>>   %x.1.lcssa.lcssa.reg2mem = alloca i32
>>>>   %x.0.lcssa.reg2mem = alloca i32
>>>>   br label %entry.split
>>>> entry.split:
>>>>   %cmp = icmp eq i32 %argc, 2
>>>>   store i32 46, i32* %cond22.reg2mem
>>>>   br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph
>>>> cond.end:
>>>>   %arrayidx = getelementptr inbounds i8** %argv, i64 1
>>>>   %0 = load i8** %arrayidx, align 8
>>>>   %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32
>>>> (i8*,
>>>> ...)*)(i8* %0)
>>>>   %cmp117 = icmp sgt i32 %call, 0
>>>>   store i32 0, i32* %x.0.lcssa.reg2mem
>>>>   store i32 %call, i32* %cond22.reg2mem
>>>>   br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8
>>>>   ...
>>>>
>>>> These store instructions significantly complicate the "polly-dependence"
>>>> pass, and thus leading to high compile-time overhead.
>>>>
>>>> I have noticed that such memory instructions are finally simplified to
>>>> scalar operations by the SROA pass, so one possible way to reduce such
>>>> compile-time overhead is to move the SROA pass ahead of polly-dependence
>>>> analysis.
>>>>
>>>> Can anyone give me some hints that why the polly-prepare pass introduces
>>>> such memory instructions? Is it possible to move the SROA pass ahead of
>>>> polly-dependence analysis?
>>>>
>>>> Thanks,
>>>> Star Tan
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google Groups
>>>> "Polly Development" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an
>>>> email to polly-dev+unsubscribe at googlegroups.com.
>>>> For more options, visit https://groups.google.com/groups/opt_out.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130816/43c17919/attachment.html>