[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops

Star Tan tanmx_star at yeah.net
Thu Aug 15 17:28:03 PDT 2013


Hi Sebpop,

Thanks for your explanation.  
I noticed that Polly would finally run the SROA pass to transform these load/store instructions into scalar operations.  Is it possible to run such a pass before polly-dependence analysis?
Star Tan
At 2013-08-15 21:12:53,"Sebastian Pop" <sebpop at gmail.com> wrote:
>Codeprepare and independent blocks are introducing these loads and stores.
>These are prepasses that polly runs prior to building the dependence graph
>to transform scalar dependences into data dependences.
>Ether was working on eliminating the rewrite of scalar dependences.
>
>On Thu, Aug 15, 2013 at 5:32 AM, Star Tan <tanmx_star at yeah.net> wrote:
>> Hi all,
>>
>> I have investigated the 6X extra compile-time overhead when Polly compiles
>> the simple nestedloop benchmark in LLVM-testsuite.
>> (http://188.40.87.11:8000/db_default/v4/nts/31?compare_to=28&baseline=28).
>> Preliminary results show that such compile-time overhead is resulted by the
>> complicated polly-dependence analysis. However, the key seems to be the
>> polly-prepare pass, which introduces a large number of store instructions,
>> and thus significantly complicating the polly-dependence pass.
>>
>> Let me show the results with a tiny code example:
>>
>> int main(int argc, char *argv[]) {
>>   int n = ((argc == 2) ? atoi(argv[1]) : 46);
>>   int a, b, x=0;
>>   for (a=0; a<n; a++)
>>     for (b=0; b<n; b++)
>>         x++;
>>   printf("%d\n", x);
>>   return(0);
>> }
>>
>> The basic LLVM IR code (produced by "clang -O1") is shown as:
>>
>> @.str = private unnamed_addr constant [4 x i8] c"%d\0A\00", align 1
>> ; Function Attrs: nounwind uwtable
>> define i32 @main(i32 %argc, i8** nocapture readonly %argv) {
>> entry:
>>   %cmp = icmp eq i32 %argc, 2
>>   br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph
>> cond.end:
>>   %arrayidx = getelementptr inbounds i8** %argv, i64 1
>>   %0 = load i8** %arrayidx, align 8
>>   %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 (i8*,
>> ...)*)(i8* %0) #3
>>   %cmp117 = icmp sgt i32 %call, 0
>>   br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8
>> for.cond2.preheader.lr.ph:
>>   %cond22 = phi i32 [ %call, %cond.end ], [ 46, %entry ]
>>   %cmp314 = icmp sgt i32 %cond22, 0
>>   br label %for.cond2.preheader
>> for.cond2.preheader:
>>   %x.019 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %x.1.lcssa,
>> %for.inc6 ]
>>   %a.018 = phi i32 [ 0, %for.cond2.preheader.lr.ph ], [ %inc7, %for.inc6 ]
>>   br i1 %cmp314, label %for.body4, label %for.inc6
>> for.body4:
>>   %x.116 = phi i32 [ %inc, %for.body4 ], [ %x.019, %for.cond2.preheader ]
>>   %b.015 = phi i32 [ %inc5, %for.body4 ], [ 0, %for.cond2.preheader ]
>>   %inc = add nsw i32 %x.116, 1
>>   %inc5 = add nsw i32 %b.015, 1
>>   %cmp3 = icmp slt i32 %inc5, %cond22
>>   br i1 %cmp3, label %for.body4, label %for.inc6
>> for.inc6:
>>   %x.1.lcssa = phi i32 [ %x.019, %for.cond2.preheader ], [ %inc, %for.body4
>> ]
>>   %inc7 = add nsw i32 %a.018, 1
>>   %cmp1 = icmp slt i32 %inc7, %cond22
>>   br i1 %cmp1, label %for.cond2.preheader, label %for.end8
>> for.end8:
>>   %x.0.lcssa = phi i32 [ 0, %cond.end ], [ %x.1.lcssa, %for.inc6 ]
>>   %call9 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4
>> x i8]* @.str, i64 0, i64 0), i32 %x.0.lcssa) #3
>>   ret i32 0
>> }
>> declare i32 @atoi(...)
>> declare i32 @printf(i8* nocapture readonly, ...)
>>
>> Such code is very simple and there is no memory instruction at all, so the
>> polly-dependence pass runs very fast on this code. Unfortunately, when we
>> run "opt -load LLVMPolly.so" for this basic LLVM IR code, the polly-prepare
>> pass would introduce a large number of store instructions like this:
>>
>> define i32 @main(i32 %argc, i8** nocapture readonly %argv) {
>> entry:
>>   %cond22.reg2mem = alloca i32
>>   %x.019.reg2mem = alloca i32
>>   %x.1.lcssa.reg2mem = alloca i32
>>   %x.1.lcssa.lcssa.reg2mem = alloca i32
>>   %x.0.lcssa.reg2mem = alloca i32
>>   br label %entry.split
>> entry.split:
>>   %cmp = icmp eq i32 %argc, 2
>>   store i32 46, i32* %cond22.reg2mem
>>   br i1 %cmp, label %cond.end, label %for.cond2.preheader.lr.ph
>> cond.end:
>>   %arrayidx = getelementptr inbounds i8** %argv, i64 1
>>   %0 = load i8** %arrayidx, align 8
>>   %call = tail call i32 (i8*, ...)* bitcast (i32 (...)* @atoi to i32 (i8*,
>> ...)*)(i8* %0)
>>   %cmp117 = icmp sgt i32 %call, 0
>>   store i32 0, i32* %x.0.lcssa.reg2mem
>>   store i32 %call, i32* %cond22.reg2mem
>>   br i1 %cmp117, label %for.cond2.preheader.lr.ph, label %for.end8
>>   ...
>>
>> These store instructions significantly complicate the "polly-dependence"
>> pass, and thus leading to high compile-time overhead.
>>
>> I have noticed that such memory instructions are finally simplified to
>> scalar operations by the SROA pass, so one possible way to reduce such
>> compile-time overhead is to move the SROA pass ahead of polly-dependence
>> analysis.
>>
>> Can anyone give me some hints that why the polly-prepare pass introduces
>> such memory instructions? Is it possible to move the SROA pass ahead of
>> polly-dependence analysis?
>>
>> Thanks,
>> Star Tan
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Polly Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to polly-dev+unsubscribe at googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130816/29a129e4/attachment.html>


More information about the llvm-dev mailing list