[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

Mon Jan 23 21:53:29 PST 2017

> On Jan 23, 2017, at 3:48 PM, Sanjay Patel via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> All targets are likely affected in some way by the icmp+shl fold introduced with r292492. It's a basic pattern that occurs in lots of code. Did you see any perf wins on your targets with this commit?
> 
> Sadly, it is also likely that many (all?) targets are negatively impacted on the particular test (SingleSource/Benchmarks/Shootout/sieve) that you have pointed out here because the IR is now decidedly worse.
> 
> IMO, we should not revert the commit because it exposed shortcomings in the optimizer. It's an "obvious" fold/canonicalization, and the related 'nuw' variant of this fold has existed in trunk since:
> https://reviews.llvm.org/rL285729 <https://reviews.llvm.org/rL285729>
> 
> We need to dissect what analysis/folds are missing to restore the IR to the better form that existed before, but this is probably going to be a long process because we treat min/max like an optimization fence. 

If this is gonna be a long process to recover, this looks like something to be reverted in the 4.0 branch (unless I missed that there is a correctness fix involved?).

— 
Mehdi

> 
> 
> 
> On Mon, Jan 23, 2017 at 11:13 AM, Evgeny Astigeevich <Evgeny.Astigeevich at arm.com <mailto:Evgeny.Astigeevich at arm.com>> wrote:
> Confirm there is no change in IR if the hack is disabled in the sources.
> 
> David wrote that these instructions are created by SCEV.
> 
> Are other targets affected by the changes, e.g. X86?
> 
>  
> 
> Kind regards,
> Evgeny Astigeevich
> Senior Compiler Engineer
> Compilation Tools
> ARM
> 
>  
> 
> From: Sanjay Patel [mailto:spatel at rotateright.com <mailto:spatel at rotateright.com>] 
> Sent: Sunday, January 22, 2017 10:45 PM
> 
> 
> To: Evgeny Astigeevich
> Cc: llvm-dev; nd
> Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
> 
>  
> 
> I tried an experiment to remove the integer min/max bailouts from InstCombine, and it doesn't appear to change the IR in the attachment, so I doubt there's going to be any improvement.
> 
> If I haven't messed up this example, this is amazing:
> https://godbolt.org/g/yzoxeY <https://godbolt.org/g/yzoxeY>
>  
> 
> On Sun, Jan 22, 2017 at 1:06 PM, Evgeny Astigeevich <Evgeny.Astigeevich at arm.com <mailto:Evgeny.Astigeevich at arm.com>> wrote:
> 
> Thank you for information.
> 
> I’ll build clang without the hack and re-run the benchmark tomorrow.
> 
>  
> 
> -Evgeny
> 
>  
> 
> From: Sanjay Patel [mailto:spatel at rotateright.com <mailto:spatel at rotateright.com>] 
> Sent: Sunday, January 22, 2017 8:00 PM
> To: Evgeny Astigeevich
> Cc: llvm-dev; nd
> 
> 
> Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
> 
>  
> 
> > Do you mean to remove the hack in InstCombiner::visitICmpInst()?
> 
>  
> 
> Yes. Although (this just came up in D28625 too) we might need to remove multiple versions of that in order to unlock optimization:
> 
> https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4338 <https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4338>
> https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L470 <https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L470>
> https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstructionCombining.cpp#L803 <https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstructionCombining.cpp#L803>
> https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L409 <https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp#L409>
> 
> Similar for FP:
> https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4780 <https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCompares.cpp#L4780>
> https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L1376 <https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineCasts.cpp#L1376>
>  
> 
> On Sun, Jan 22, 2017 at 12:40 PM, Evgeny Astigeevich <Evgeny.Astigeevich at arm.com <mailto:Evgeny.Astigeevich at arm.com>> wrote:
> 
> Hi Sanjay,
> 
>  
> 
> The benchmark source file: http://www.llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/Benchmarks/Shootout/sieve.c?view=markup <http://www.llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/Benchmarks/Shootout/sieve.c?view=markup>
> Clang options used to produce the initial IR: clang -DNDEBUG  -O3 -DNDEBUG -mcpu=cortex-a53 -fomit-frame-pointer -O3 -DNDEBUG   -w -Werror=date-time -c sieve.c -S -emit-llvm -mllvm -disable-llvm-optzns --target=aarch64-arm-linux
> 
> Opt options: opt -O3 -o /dev/null -print-before-all -print-after-all sieve.ll >& sieve.log
> 
>  
> 
> I used the IR (in attached sieve.zip) created with the r292487 version.
> 
> The attached sieve contains the output of ‘-print-before-all -print-after-all’  for r292487 and rL292492.
> 
>  
> 
> > If it's possible, can you remove that check locally, rebuild,
> 
> > and try the benchmark again on your system? I'd love to know
> 
> > if that change alone would solve the problem.
> 
>  
> 
> Do you mean to remove the hack in InstCombiner::visitICmpInst()?
> 
>  
> 
> Kind regards,
> Evgeny Astigeevich
> Senior Compiler Engineer
> Compilation Tools
> ARM
> 
>  
> 
> From: Sanjay Patel [mailto:spatel at rotateright.com <mailto:spatel at rotateright.com>] 
> Sent: Friday, January 20, 2017 6:16 PM
> To: Evgeny Astigeevich
> Cc: llvm-dev; Renato Golin; t.p.northover at gmail.com <mailto:t.p.northover at gmail.com>; hfinkel at anl.gov <mailto:hfinkel at anl.gov>
> Subject: Re: [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
> 
>  
> 
> Thanks for letting me know about this problem!
> 
> There's no 'shl nsw' visible in the earlier (r292487) code, so it would be better to see exactly what the IR looks like before that added transform fires.
> 
> 
> But I see a red flag:
>   %smax = select i1 %11, i64 %10, i64 8193
> 
> The new icmp transform allowed us to create an smax, but we have this hack in InstCombiner::visitICmpInst():
> 
>   // Test if the ICmpInst instruction is used exclusively by a select as
>   // part of a minimum or maximum operation. If so, refrain from doing
>   // any other folding. This helps out other analyses which understand
>   // non-obfuscated minimum and maximum idioms, such as ScalarEvolution
>   // and CodeGen. And in this case, at least one of the comparison
>   // operands has at least one user besides the compare (the select),
>   // which would often largely negate the benefit of folding anyway.
> 
> ...so that prevented folding the icmp into the earlier math.
> 
> I am actively working on trying to get rid of that bail-out by improving min/max value tracking and icmp/select folding. In fact, we might be able to remove it right now, but I don't know the history of that code or what cases it was supposed to help.
> 
>  
> 
> If it's possible, can you remove that check locally, rebuild, and try the benchmark again on your system? I'd love to know if that change alone would solve the problem.
> 
>  
> 
>  
> 
> On Fri, Jan 20, 2017 at 10:11 AM, Evgeny Astigeevich <Evgeny.Astigeevich at arm.com <mailto:Evgeny.Astigeevich at arm.com>> wrote:
> 
> Hi,
> 
> We found that today's 17.30%/11.37% performance regressions in LNT SingleSource/Benchmarks/Shootout/sieve on LNT-AArch64-A53-O3__clang_DEV__aarch64 and LNT-Thumb2v7-A15-O3__clang_DEV__thumbv7 (http://llvm.org/perf/db_default/v4/nts/daily_report/2017/1/20?filter-machine-regex=aarch64%7Carm%7Cthumb%7Cgreen <http://llvm.org/perf/db_default/v4/nts/daily_report/2017/1/20?filter-machine-regex=aarch64%7Carm%7Cthumb%7Cgreen>) are caused by changes [rL292492] in InstCombine:
> 
> https://reviews.llvm.org/D28406 <https://reviews.llvm.org/D28406> "[InstCombine] icmp sgt (shl nsw X, C1), C0 --> icmp sgt X, C0 >> C1"
> 
> The Loop Vectorizer generates code with more instructions:
> 
> ==== Loop Vectorizer from rL292492  ====
> for.body5:                                        ; preds = %for.inc16.for.body5_crit_edge, %for.cond.preheader
>   %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge ], [ 0, %for.cond.preheader ]
>   %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1, %for.cond.preheader ]
>   %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge ], [ 0, %for.cond.preheader ]
>   %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2, %for.cond.preheader ]
>   %2 = add i64 %indvar, 2
>   %3 = shl i64 %indvar, 1
>   %4 = add i64 %3, 4
>   %5 = add i64 %indvar, 2
>   %6 = shl i64 %indvar, 1
>   %7 = add i64 %6, 4
>   %8 = add i64 %indvar, 2
>   %9 = mul i64 %indvar, 3
>   %10 = add i64 %9, 6
>   %11 = icmp sgt i64 %10, 8193
>   %smax = select i1 %11, i64 %10, i64 8193
>   %12 = mul i64 %indvar, -2
>   %13 = add i64 %12, -5
>   %14 = add i64 %smax, %13
>   %15 = add i64 %indvar, 2
>   %16 = udiv i64 %14, %15
>   %17 = add i64 %16, 1
>   %tobool7 = icmp eq i8 %1, 0
>   br i1 %tobool7, label %for.inc16, label %if.then
> ================================
> 
> The code generated by the Loop Vectorizer before the changes:
> 
> ==== Loop Vectorizer from rL292487 ====
> for.body5:                                        ; preds = %for.inc16.for.body5_crit_edge, %for.cond.preheader
>   %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge ], [ 0, %for.cond.preheader ]
>   %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1, %for.cond.preheader ]
>   %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge ], [ 0, %for.cond.preheader ]
>   %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2, %for.cond.preheader ]
>   %2 = add i64 %indvar, 2
>   %3 = shl i64 %indvar, 1
>   %4 = add i64 %3, 4
>   %5 = add i64 %indvar, 2
>   %6 = shl i64 %indvar, 1
>   %7 = add i64 %6, 4
>   %8 = add i64 %indvar, 2
>   %9 = mul i64 %indvar, -2
>   %10 = add i64 %9, 8188
>   %11 = add i64 %indvar, 2
>   %12 = udiv i64 %10, %11
>   %13 = add i64 %12, 1
>   %tobool7 = icmp eq i8 %1, 0
>   br i1 %tobool7, label %for.inc16, label %if.then
> ================================
> 
> I have not investigated yet why the behaviour of the Vectorizer is changed.
> 
> Kind regards,
> Evgeny Astigeevich
> Senior Compiler Engineer
> Compilation Tools
> ARM
> 
>  
> 
>  
> 
>  
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170123/556d49f2/attachment.html>