[PATCH] D34583: [LSR] Narrow search space by filtering non-optimal formulae with the same ScaledReg and Scale.

Tue Aug 8 09:16:42 PDT 2017

Evgeny, thanks for confirming the issue and submitting the patch!

Wei.

On Tue, Aug 8, 2017 at 7:20 AM, Evgeny Astigeevich
<Evgeny.Astigeevich at arm.com> wrote:
> I submitted https://reviews.llvm.org/D36467 which fixes the issue.
>
> I tested the performance is back. We even get additional ~6% improvements on
> the benchmarks. We also found we had another issue related to LSR and caused
> by this bug in ARMTargetLowering::isLegalAddressingMode. The patch fixed it
> as well.
>
>
> Thank you, Wei, for identifying the root cause.
>
>
> -Evgeny
>
> ________________________________
> From: Evgeny Astigeevich
> Sent: Monday, August 7, 2017 6:18:08 PM
>
> To: Wei Mi
> Cc: Quentin Colombet;
> reviews+D34583+public+14017146d8617607 at reviews.llvm.org; Evgeny Stupachenko;
> Xinliang David Li; Sanjoy Das; Matthias Braun; Dehao Chen; llvm-commits;
> Michael Zolotukhin; nd
> Subject: Re: [PATCH] D34583: [LSR] Narrow search space by filtering
> non-optimal formulae with the same ScaledReg and Scale.
>
>
> I've created a bug report: https://bugs.llvm.org/show_bug.cgi?id=34106
>
>
> The fix is in progress.
>
>
> Thanks,
>
> Evgeny
>
> ________________________________
> From: Evgeny Astigeevich
> Sent: Monday, August 7, 2017 1:27:15 PM
> To: Wei Mi
> Cc: Quentin Colombet;
> reviews+D34583+public+14017146d8617607 at reviews.llvm.org; Evgeny Stupachenko;
> Xinliang David Li; Sanjoy Das; Matthias Braun; Dehao Chen; llvm-commits;
> Michael Zolotukhin; nd
> Subject: Re: [PATCH] D34583: [LSR] Narrow search space by filtering
> non-optimal formulae with the same ScaledReg and Scale.
>
>
> Hi Wei,
>
>
> Thank you for the patch.
>
> I think you have spotted a bug in ARMTargetLowering::isLegalAddressingMode.
>
> Thumb1 addressing modes do not support scaling. So false must be returned if
> AM.Scale is not zero. ARMTargetLowering::isLegalAddressingMode return false
> only when Scale equals 1 and returns true for other values.
>
> I am testing a fix of the issue. I'll also raise a bug report about this
> issue.
>
>
> Thnaks,
>
> Evgeny
>
> ________________________________
> From: Wei Mi <wmi at google.com>
> Sent: Friday, August 4, 2017 7:27:47 PM
> To: Evgeny Astigeevich
> Cc: Quentin Colombet;
> reviews+D34583+public+14017146d8617607 at reviews.llvm.org; Evgeny Stupachenko;
> Xinliang David Li; Sanjoy Das; Matthias Braun; Dehao Chen; llvm-commits;
> Michael Zolotukhin; nd
> Subject: Re: [PATCH] D34583: [LSR] Narrow search space by filtering
> non-optimal formulae with the same ScaledReg and Scale.
>
> On Fri, Aug 4, 2017 at 11:16 AM, Evgeny Astigeevich
> <Evgeny.Astigeevich at arm.com> wrote:
>> I compared the generated code with and without this check:
>>
>>
>> --- check is ON ---
>>
>> @ BB#0:                                 @ %entry
>>         cmp     r2, #0
>>         moveq   pc, lr
>>         add     r0, r0, r2, lsl #2
>>         mov     r12, #0
>>         mov     r3, #0
>> .LBB1_1:                                @ %bb
>>                                         @ =>This Inner Loop Header:
>> Depth=1
>>         str     r1, [r0, -r3, lsl #2]
>>         str     r3, [r12]
>>         add     r3, r3, #1
>>         cmp     r2, r3
>>         bne     .LBB1_1
>>
>> --- check is OFF ---
>>
>> @ BB#0:                                 @ %entry
>>         cmp     r2, #0
>>         moveq   pc, lr
>>         mov     r12, #0
>>         mov     r3, #0
>> .LBB1_1:                                @ %bb
>>                                         @ =>This Inner Loop Header:
>> Depth=1
>>         str     r1, [r0, r2, lsl #2]
>>         subs    r2, r2, #1
>>         str     r3, [r12]
>>         add     r3, r3, #1
>>         bne     .LBB1_1
>>
>> The main difference of 'check is OFF' is that we need two registers to
>> keep
>> values. As '-Rm' cannot be used in Thumb code it is not working for v6m.
>>
>>
>> Thanks,
>>
>> Evgeny
>
> Thanks Evgeny. So can we add a subtarget check around this scale<0 check?
>
> Index: lib/Target/ARM/ARMISelLowering.cpp
> ===================================================================
> --- lib/Target/ARM/ARMISelLowering.cpp (revision 309240)
> +++ lib/Target/ARM/ARMISelLowering.cpp (working copy)
> @@ -12411,7 +12411,7 @@ bool ARMTargetLowering::isLegalAddressin
>      case MVT::i1:
>      case MVT::i8:
>      case MVT::i32:
> -      if (Scale < 0) Scale = -Scale;
> +      if (!Subtarget->isThumb() && Scale < 0) Scale = -Scale;
>        if (Scale == 1)
>          return true;
>        // r + r << imm
>
> Wei.
>
>>
>>
>> ________________________________
>> From: Evgeny Astigeevich
>> Sent: Friday, August 4, 2017 6:42:53 PM
>> To: Wei Mi; Quentin Colombet
>> Cc: reviews+D34583+public+14017146d8617607 at reviews.llvm.org; Evgeny
>> Stupachenko; Xinliang David Li; Sanjoy Das; Matthias Braun; Dehao Chen;
>> llvm-commits; Michael Zolotukhin; nd
>>
>> Subject: Re: [PATCH] D34583: [LSR] Narrow search space by filtering
>> non-optimal formulae with the same ScaledReg and Scale.
>>
>>
>> Hi Wei,
>>
>>
>> The fix will cause a failure of CodeGen/ARM/arm-negative-stride.ll. The
>> check was introduced in r35859 (10 years ago):
>>
>>
>> ------------------------------------------------------------------------
>> r35859 | lattner | 2007-04-10 04:48:29 +0100 (Tue, 10 Apr 2007) | 2 lines
>>
>> restore support for negative strides
>>
>> ------------------------------------------------------------------------
>>
>> The test has a triple: arm-eabi. So code is generated for arm7tdmi
>> (armv4).
>> Maybe we need a case for the M-profile. I'll try to figure it out.
>>
>> Thanks,
>> Evgeny Astigeevich
>> The ARM Compiler Optimization team leader
>>
>> ________________________________
>> From: Wei Mi <wmi at google.com>
>> Sent: Friday, August 4, 2017 5:23:47 PM
>> To: Quentin Colombet
>> Cc: reviews+D34583+public+14017146d8617607 at reviews.llvm.org; Evgeny
>> Stupachenko; Xinliang David Li; Sanjoy Das; Evgeny Astigeevich; Matthias
>> Braun; Dehao Chen; llvm-commits; Michael Zolotukhin
>> Subject: Re: [PATCH] D34583: [LSR] Narrow search space by filtering
>> non-optimal formulae with the same ScaledReg and Scale.
>>
>> On Thu, Aug 3, 2017 at 6:08 PM, Quentin Colombet <qcolombet at apple.com>
>> wrote:
>>>
>>>> On Aug 3, 2017, at 4:46 PM, Wei Mi <wmi at google.com> wrote:
>>>>
>>>> I found isAMCompletelyFolded return true for formula like
>>>> reg(%c1.0104.us.i.i) + -1*reg({0,+,-4}<nsw><%for.body8.us.i.i>) +
>>>> imm(4) on ARM. Does it make sense?
>>>
>>>
>>> That sounds wrong.
>>
>> With the following fix, the regression of test.ll will be fixed, but I
>> am not familiar with ARM so I am not sure whether allowing negative
>> scale has any other usage. Quentin, does the fix looks correct to you?
>>
>> Index: lib/Target/ARM/ARMISelLowering.cpp
>> ===================================================================
>> --- lib/Target/ARM/ARMISelLowering.cpp (revision 309240)
>> +++ lib/Target/ARM/ARMISelLowering.cpp (working copy)
>> @@ -12411,7 +12411,6 @@ bool ARMTargetLowering::isLegalAddressin
>>      case MVT::i1:
>>      case MVT::i8:
>>      case MVT::i32:
>> -      if (Scale < 0) Scale = -Scale;
>>        if (Scale == 1)
>>          return true;
>>        // r + r << imm
>>
>>
>>
>>>
>>>> The fact is we cannot fold it into
>>>> load/store and it leads to the additional instructions causing
>>>> performance regressions.
>>>>
>>>>
>>>> On Thu, Aug 3, 2017 at 11:12 AM, Wei Mi via Phabricator
>>>> <reviews at reviews.llvm.org> wrote:
>>>>> wmi added a comment.
>>>>>
>>>>> In https://reviews.llvm.org/D34583#830286, @eastig wrote:
>>>>>
>>>>>> F4171725: test.ll <https://reviews.llvm.org/F4171725>
>>>>>>
>>>>>> F4171724: test.good.ll <https://reviews.llvm.org/F4171724>
>>>>>>
>>>>>> F4171723: test.bad.ll <https://reviews.llvm.org/F4171723>
>>>>>>
>>>>>> This patch caused regressions from 5% to 23% in two our internal
>>>>>> benchmarks on Cortex-M23 and Cortex-M0+. I attached test.ll which is
>>>>>> reduced
>>>>>> from the benchmarks. I used LLVM revision 309830. 'test.good.ll' is a
>>>>>> result
>>>>>> when filtering is disabled. 'test.bad.ll' is a result when filtering
>>>>>> is
>>>>>> enabled.
>>>>>> Comparing them I can see that this optimization changes how an
>>>>>> induction variable is changed. Originally it is incremented from 0 to
>>>>>> 256.
>>>>>> The optimization changes this into decrementing from 0 to -256. This
>>>>>> induction variable is also used as an offset to memory. So to preserve
>>>>>> this
>>>>>> semantic conversion of the induction variable from a negative value to
>>>>>> a
>>>>>> positive value is inserted. This is lowered to additional instructions
>>>>>> which
>>>>>> causes performance regressions.
>>>>>>
>>>>>> Could you please have a look at this issue?
>>>>>>
>>>>>> Thanks,
>>>>>> Evgeny Astigeevich
>>>>>> The ARM Compiler Optimization team leader
>>>>>
>>>>>
>>>>> Hi Evgeny,
>>>>>
>>>>> Thanks for providing the testcase.
>>>>>
>>>>> It looks like an existing issue in LSR cost evaluation exposed by the
>>>>> patch. Actually, comparing the trace by adding -debug-only=loop-reduce,
>>>>> all
>>>>> the candidates choosen by LSR without filtering are kept in the
>>>>> candidate
>>>>> set after adding the filter patch. However filtering patch provides
>>>>> some
>>>>> more candidates interesting for LSR cost model to choose, and LSR
>>>>> chooses a
>>>>> different set of candidates in the final result which it thinks better
>>>>> (1
>>>>> less base add) but actually not. We can see that in the trace:
>>>>>
>>>>> LSR without the filtering patch:
>>>>> The chosen solution requires 5 regs, with addrec cost 2, plus 2 base
>>>>> adds, plus 4 imm cost, plus 1 setup cost:
>>>>>
>>>>>  LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup
>>>>> type: i8*
>>>>>    reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) +
>>>>> 1*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)
>>>>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>>>    reg(256) + -1*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)
>>>>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
>>>>> fixup
>>>>> type: i32*
>>>>>    reg(%c0.0103.us.i.i) + 4*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)
>>>>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0,4}, widest
>>>>> fixup type: i32*
>>>>>    reg({(-4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>)
>>>>>  LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest
>>>>> fixup type: i32*
>>>>>    reg(%c0.0103.us.i.i)
>>>>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
>>>>> fixup
>>>>> type: i32*
>>>>>    reg(%c1.0104.us.i.i) + 4*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>) +
>>>>> imm(4)
>>>>>  LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest
>>>>> fixup type: i32*
>>>>>    reg(%c1.0104.us.i.i)
>>>>>
>>>>> LSR with the filtering patch:
>>>>> The chosen solution requires 5 regs, with addrec cost 2, plus 1 base
>>>>> add, plus 4 imm cost, plus 1 setup cost:
>>>>>
>>>>>  LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup
>>>>> type: i8*
>>>>>    reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) +
>>>>> -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>>>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>>>    reg(-256) + -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>>>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
>>>>> fixup
>>>>> type: i32*
>>>>>    reg(%c0.0103.us.i.i) + -4*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>>>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0,4}, widest
>>>>> fixup type: i32*
>>>>>    reg({(4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>) +
>>>>> imm(-8)
>>>>>  LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest
>>>>> fixup type: i32*
>>>>>    reg(%c0.0103.us.i.i)
>>>>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
>>>>> fixup
>>>>> type: i32*
>>>>>    reg({(4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>)
>>>>>  LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest
>>>>> fixup type: i32*
>>>>>    reg(%c1.0104.us.i.i)
>>>>>
>>>>> The real problem is that LSR has no idea about the cost of getting
>>>>> negative value. It thinks 4*reg({0,+,-1} and -4*reg({0,+,-1} have the
>>>>> same
>>>>> cost.
>>>>>
>>>>>  LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup
>>>>> type: i8*
>>>>>    reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) +
>>>>> -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>>>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest
>>>>> fixup
>>>>> type: i32*
>>>>>    reg(%c0.0103.us.i.i) + -4*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>>>>
>>>>> I will think about how to fix it.
>>>>>
>>>>> Wei.
>>>>>
>>>>>
>>>>> Repository:
>>>>>  rL LLVM
>>>>>
>>>>> https://reviews.llvm.org/D34583
>>>>>
>>>>>
>>>>>
>>>