[llvm-dev] [InstCombine] rL292492 affected LoopVectorizer and caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines

Sanjay Patel via llvm-dev llvm-dev at lists.llvm.org
Tue Jan 24 08:55:07 PST 2017


On Tue, Jan 24, 2017 at 9:24 AM, Mehdi Amini <mehdi.amini at apple.com> wrote:

>
> On Jan 24, 2017, at 7:18 AM, Sanjay Patel <spatel at rotateright.com> wrote:
>
>
>
> On Mon, Jan 23, 2017 at 10:53 PM, Mehdi Amini <mehdi.amini at apple.com>
> wrote:
>
>>
>> On Jan 23, 2017, at 3:48 PM, Sanjay Patel via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> All targets are likely affected in some way by the icmp+shl fold
>> introduced with r292492. It's a basic pattern that occurs in lots of code.
>> Did you see any perf wins on your targets with this commit?
>>
>> Sadly, it is also likely that many (all?) targets are negatively impacted
>> on the particular test (SingleSource/Benchmarks/Shootout/sieve) that you
>> have pointed out here because the IR is now decidedly worse.
>>
>> IMO, we should not revert the commit because it exposed shortcomings in
>> the optimizer. It's an "obvious" fold/canonicalization, and the related
>> 'nuw' variant of this fold has existed in trunk since:
>> https://reviews.llvm.org/rL285729
>>
>> We need to dissect what analysis/folds are missing to restore the IR to
>> the better form that existed before, but this is probably going to be a
>> long process because we treat min/max like an optimization fence.
>>
>>
>> If this is gonna be a long process to recover, this looks like something
>> to be reverted in the 4.0 branch (unless I missed that there is a
>> correctness fix involved?).
>>
>>
> Nope - this is just about perf, not correctness. Of course, the intent was
> that this transform should only improve perf, so I wonder if we can pin any
> other perf changes from this commit.
>
> I'm new to using the LNT site, but this should be the full set of results
> for the A53 machine in question with a baseline (r292491) before this patch
> and current (r292522) :
> http://llvm.org/perf/db_default/v4/nts/107364
>
> If these are reliable results, we have 2 perf wins (puzzle, gramschmidt)
> on the A53 machine. How do we determine the importance of the sieve
> benchmark vs. the rest of the suite?
>
> An x86 machine doesn't show any regressions from this change:
> http://llvm.org/perf/db_default/v4/nts/107353
>
> Are there target-scope-based guidelines for when something is bad enough
> to revert?
>
>
> I don’t think we have any guidelines.
>
> I think my suggestion was more about other regression that we would
> discover after the release, it was more of a “maturity” call: if we just
> notice a problem with the commit right before the release, it may not have
> been in tree long enough to get enough scrutiny.
>

That makes sense. I have no stake in any particular branch, so I have no
objection to revert from the release branch if that's what people would
like to do. My preference is to keep it in trunk though because it should
be a win in theory and reverting there would make it harder to find and
debug problems like this one.





>
> (Also I thought this thread included a compile time regression, which on
> re-read it doesn’t).
>
>> Mehdi
>
>
>
> Also, we've absolutely destroyed perf (-48%) on the sieve benchmark on
> that A53 target since the baseline (r256803). There are multiple things to
> fix before we can truly recover?
>
> Regardless of whether we revert or not, I am looking at how to clawback
> the IR from the r292492 regression. Here's one step towards that:
> https://reviews.llvm.org/D29053
>
> If we get lucky, we may be able to sidestep the min/max problem by folding
> harder before we reach that point in the optimization pipeline.
>
>
>
>
>>>> Mehdi
>>
>>
>>
>>
>>
>>
>> On Mon, Jan 23, 2017 at 11:13 AM, Evgeny Astigeevich <Evgeny.
>> Astigeevich at arm.com> wrote:
>>
>>> Confirm there is no change in IR if the hack is disabled in the sources.
>>>
>>> David wrote that these instructions are created by SCEV.
>>>
>>> Are other targets affected by the changes, e.g. X86?
>>>
>>>
>>>
>>> Kind regards,
>>> Evgeny Astigeevich
>>> Senior Compiler Engineer
>>> Compilation Tools
>>> ARM
>>>
>>>
>>>
>>> *From:* Sanjay Patel [mailto:spatel at rotateright.com]
>>> *Sent:* Sunday, January 22, 2017 10:45 PM
>>>
>>> *To:* Evgeny Astigeevich
>>> *Cc:* llvm-dev; nd
>>> *Subject:* Re: [InstCombine] rL292492 affected LoopVectorizer and
>>> caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
>>>
>>>
>>>
>>> I tried an experiment to remove the integer min/max bailouts from
>>> InstCombine, and it doesn't appear to change the IR in the attachment, so I
>>> doubt there's going to be any improvement.
>>>
>>> If I haven't messed up this example, this is amazing:
>>> https://godbolt.org/g/yzoxeY
>>>
>>>
>>>
>>> On Sun, Jan 22, 2017 at 1:06 PM, Evgeny Astigeevich <
>>> Evgeny.Astigeevich at arm.com> wrote:
>>>
>>> Thank you for information.
>>>
>>> I’ll build clang without the hack and re-run the benchmark tomorrow.
>>>
>>>
>>>
>>> -Evgeny
>>>
>>>
>>>
>>> *From:* Sanjay Patel [mailto:spatel at rotateright.com]
>>> *Sent:* Sunday, January 22, 2017 8:00 PM
>>> *To:* Evgeny Astigeevich
>>> *Cc:* llvm-dev; nd
>>>
>>>
>>> *Subject:* Re: [InstCombine] rL292492 affected LoopVectorizer and
>>> caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
>>>
>>>
>>>
>>> > Do you mean to remove the hack in InstCombiner::visitICmpInst()?
>>>
>>>
>>>
>>> Yes. Although (this just came up in D28625 too) we might need to remove
>>> multiple versions of that in order to unlock optimization:
>>>
>>> https://github.com/llvm-mirror/llvm/blob/master/lib/Transfor
>>> ms/InstCombine/InstCombineCompares.cpp#L4338
>>>
>>> https://github.com/llvm-mirror/llvm/blob/master/lib/Transfor
>>> ms/InstCombine/InstCombineCasts.cpp#L470
>>>
>>> https://github.com/llvm-mirror/llvm/blob/master/lib/Transfor
>>> ms/InstCombine/InstructionCombining.cpp#L803
>>>
>>> https://github.com/llvm-mirror/llvm/blob/master/lib/Transfor
>>> ms/InstCombine/InstCombineSimplifyDemanded.cpp#L409
>>>
>>>
>>> Similar for FP:
>>> https://github.com/llvm-mirror/llvm/blob/master/lib/Transfor
>>> ms/InstCombine/InstCombineCompares.cpp#L4780
>>> https://github.com/llvm-mirror/llvm/blob/master/lib/Transfor
>>> ms/InstCombine/InstCombineCasts.cpp#L1376
>>>
>>>
>>>
>>> On Sun, Jan 22, 2017 at 12:40 PM, Evgeny Astigeevich <
>>> Evgeny.Astigeevich at arm.com> wrote:
>>>
>>> Hi Sanjay,
>>>
>>>
>>>
>>> The benchmark source file: http://www.llvm.org/viewvc/llvm-project/test-
>>> suite/trunk/SingleSource/Benchmarks/Shootout/sieve.c?view=markup
>>>
>>> Clang options used to produce the initial IR: clang -DNDEBUG  -O3
>>> -DNDEBUG -mcpu=cortex-a53 -fomit-frame-pointer -O3 -DNDEBUG   -w
>>> -Werror=date-time -c sieve.c -S -emit-llvm -mllvm -disable-llvm-optzns
>>> --target=aarch64-arm-linux
>>>
>>> Opt options: opt -O3 -o /dev/null -print-before-all -print-after-all
>>> sieve.ll >& sieve.log
>>>
>>>
>>>
>>> I used the IR (in attached sieve.zip) created with the r292487 version.
>>>
>>> The attached sieve contains the output of ‘-print-before-all
>>> -print-after-all’  for r292487 and rL292492.
>>>
>>>
>>>
>>> > If it's possible, can you remove that check locally, rebuild,
>>>
>>> > and try the benchmark again on your system? I'd love to know
>>>
>>> > if that change alone would solve the problem.
>>>
>>>
>>>
>>> Do you mean to remove the hack in InstCombiner::visitICmpInst()?
>>>
>>>
>>>
>>> Kind regards,
>>> Evgeny Astigeevich
>>> Senior Compiler Engineer
>>> Compilation Tools
>>> ARM
>>>
>>>
>>>
>>> *From:* Sanjay Patel [mailto:spatel at rotateright.com]
>>> *Sent:* Friday, January 20, 2017 6:16 PM
>>> *To:* Evgeny Astigeevich
>>> *Cc:* llvm-dev; Renato Golin; t.p.northover at gmail.com; hfinkel at anl.gov
>>> *Subject:* Re: [InstCombine] rL292492 affected LoopVectorizer and
>>> caused 17.30%/11.37% perf regressions on Cortex-A53/Cortex-A15 LNT machines
>>>
>>>
>>>
>>> Thanks for letting me know about this problem!
>>>
>>> There's no 'shl nsw' visible in the earlier (r292487) code, so it would
>>> be better to see exactly what the IR looks like before that added transform
>>> fires.
>>>
>>>
>>> But I see a red flag:
>>>   %smax = select i1 %11, i64 %10, i64 8193
>>>
>>> The new icmp transform allowed us to create an smax, but we have this
>>> hack in InstCombiner::visitICmpInst():
>>>
>>>   // Test if the ICmpInst instruction is used exclusively by a select as
>>>   // part of a minimum or maximum operation. If so, refrain from doing
>>>   // any other folding. This helps out other analyses which understand
>>>   // non-obfuscated minimum and maximum idioms, such as ScalarEvolution
>>>   // and CodeGen. And in this case, at least one of the comparison
>>>   // operands has at least one user besides the compare (the select),
>>>   // which would often largely negate the benefit of folding anyway.
>>>
>>> ...so that prevented folding the icmp into the earlier math.
>>>
>>> I am actively working on trying to get rid of that bail-out by improving
>>> min/max value tracking and icmp/select folding. In fact, we might be able
>>> to remove it right now, but I don't know the history of that code or what
>>> cases it was supposed to help.
>>>
>>>
>>>
>>> If it's possible, can you remove that check locally, rebuild, and try
>>> the benchmark again on your system? I'd love to know if that change alone
>>> would solve the problem.
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Jan 20, 2017 at 10:11 AM, Evgeny Astigeevich <
>>> Evgeny.Astigeevich at arm.com> wrote:
>>>
>>> Hi,
>>>
>>> We found that today's 17.30%/11.37% performance regressions in LNT
>>> SingleSource/Benchmarks/Shootout/sieve on LNT-AArch64-A53-O3__clang_DEV__aarch64
>>> and LNT-Thumb2v7-A15-O3__clang_DEV__thumbv7 (
>>> http://llvm.org/perf/db_default/v4/nts/daily_report/2017/1/
>>> 20?filter-machine-regex=aarch64%7Carm%7Cthumb%7Cgreen) are caused by
>>> changes [rL292492] in InstCombine:
>>>
>>> https://reviews.llvm.org/D28406 "[InstCombine] icmp sgt (shl nsw X,
>>> C1), C0 --> icmp sgt X, C0 >> C1"
>>>
>>> The Loop Vectorizer generates code with more instructions:
>>>
>>> ==== Loop Vectorizer from rL292492  ====
>>> for.body5:                                        ; preds =
>>> %for.inc16.for.body5_crit_edge, %for.cond.preheader
>>>   %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge ], [
>>> 0, %for.cond.preheader ]
>>>   %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1,
>>> %for.cond.preheader ]
>>>   %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge ], [
>>> 0, %for.cond.preheader ]
>>>   %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2,
>>> %for.cond.preheader ]
>>>   %2 = add i64 %indvar, 2
>>>   %3 = shl i64 %indvar, 1
>>>   %4 = add i64 %3, 4
>>>   %5 = add i64 %indvar, 2
>>>   %6 = shl i64 %indvar, 1
>>>   %7 = add i64 %6, 4
>>>   %8 = add i64 %indvar, 2
>>>   %9 = mul i64 %indvar, 3
>>>   %10 = add i64 %9, 6
>>>   %11 = icmp sgt i64 %10, 8193
>>>   %smax = select i1 %11, i64 %10, i64 8193
>>>   %12 = mul i64 %indvar, -2
>>>   %13 = add i64 %12, -5
>>>   %14 = add i64 %smax, %13
>>>   %15 = add i64 %indvar, 2
>>>   %16 = udiv i64 %14, %15
>>>   %17 = add i64 %16, 1
>>>   %tobool7 = icmp eq i8 %1, 0
>>>   br i1 %tobool7, label %for.inc16, label %if.then
>>> ================================
>>>
>>> The code generated by the Loop Vectorizer before the changes:
>>>
>>> ==== Loop Vectorizer from rL292487 ====
>>> for.body5:                                        ; preds =
>>> %for.inc16.for.body5_crit_edge, %for.cond.preheader
>>>   %indvar = phi i64 [ %indvar.next, %for.inc16.for.body5_crit_edge ], [
>>> 0, %for.cond.preheader ]
>>>   %1 = phi i8 [ %.pre, %for.inc16.for.body5_crit_edge ], [ 1,
>>> %for.cond.preheader ]
>>>   %count.122 = phi i32 [ %count.2, %for.inc16.for.body5_crit_edge ], [
>>> 0, %for.cond.preheader ]
>>>   %i.119 = phi i64 [ %inc17, %for.inc16.for.body5_crit_edge ], [ 2,
>>> %for.cond.preheader ]
>>>   %2 = add i64 %indvar, 2
>>>   %3 = shl i64 %indvar, 1
>>>   %4 = add i64 %3, 4
>>>   %5 = add i64 %indvar, 2
>>>   %6 = shl i64 %indvar, 1
>>>   %7 = add i64 %6, 4
>>>   %8 = add i64 %indvar, 2
>>>   %9 = mul i64 %indvar, -2
>>>   %10 = add i64 %9, 8188
>>>   %11 = add i64 %indvar, 2
>>>   %12 = udiv i64 %10, %11
>>>   %13 = add i64 %12, 1
>>>   %tobool7 = icmp eq i8 %1, 0
>>>   br i1 %tobool7, label %for.inc16, label %if.then
>>> ================================
>>>
>>> I have not investigated yet why the behaviour of the Vectorizer is
>>> changed.
>>>
>>> Kind regards,
>>> Evgeny Astigeevich
>>> Senior Compiler Engineer
>>> Compilation Tools
>>> ARM
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170124/479fdd8b/attachment-0001.html>


More information about the llvm-dev mailing list