[PATCH] D34583: [LSR] Narrow search space by filtering non-optimal formulae with the same ScaledReg and Scale.

Evgeny Astigeevich via llvm-commits llvm-commits at lists.llvm.org
Fri Aug 4 11:16:10 PDT 2017


I compared the generated code with and without this check:


--- check is ON ---

@ BB#0:                                 @ %entry
        cmp     r2, #0
        moveq   pc, lr
        add     r0, r0, r2, lsl #2
        mov     r12, #0
        mov     r3, #0
.LBB1_1:                                @ %bb
                                        @ =>This Inner Loop Header: Depth=1
        str     r1, [r0, -r3, lsl #2]
        str     r3, [r12]
        add     r3, r3, #1
        cmp     r2, r3
        bne     .LBB1_1

--- check is OFF ---

@ BB#0:                                 @ %entry
        cmp     r2, #0
        moveq   pc, lr
        mov     r12, #0
        mov     r3, #0
.LBB1_1:                                @ %bb
                                        @ =>This Inner Loop Header: Depth=1
        str     r1, [r0, r2, lsl #2]
        subs    r2, r2, #1
        str     r3, [r12]
        add     r3, r3, #1
        bne     .LBB1_1

The main difference of 'check is OFF' is that we need two registers to keep values. As '-Rm' cannot be used in Thumb code it is not working for v6m.


Thanks,

Evgeny


________________________________
From: Evgeny Astigeevich
Sent: Friday, August 4, 2017 6:42:53 PM
To: Wei Mi; Quentin Colombet
Cc: reviews+D34583+public+14017146d8617607 at reviews.llvm.org; Evgeny Stupachenko; Xinliang David Li; Sanjoy Das; Matthias Braun; Dehao Chen; llvm-commits; Michael Zolotukhin; nd
Subject: Re: [PATCH] D34583: [LSR] Narrow search space by filtering non-optimal formulae with the same ScaledReg and Scale.


Hi Wei,


The fix will cause a failure of CodeGen/ARM/arm-negative-stride.ll. The check was introduced in r35859 (10 years ago):

------------------------------------------------------------------------
r35859 | lattner | 2007-04-10 04:48:29 +0100 (Tue, 10 Apr 2007) | 2 lines

restore support for negative strides

------------------------------------------------------------------------

The test has a triple: arm-eabi. So code is generated for arm7tdmi (armv4). Maybe we need a case for the M-profile. I'll try to figure it out.

Thanks,
Evgeny Astigeevich
The ARM Compiler Optimization team leader

________________________________
From: Wei Mi <wmi at google.com>
Sent: Friday, August 4, 2017 5:23:47 PM
To: Quentin Colombet
Cc: reviews+D34583+public+14017146d8617607 at reviews.llvm.org; Evgeny Stupachenko; Xinliang David Li; Sanjoy Das; Evgeny Astigeevich; Matthias Braun; Dehao Chen; llvm-commits; Michael Zolotukhin
Subject: Re: [PATCH] D34583: [LSR] Narrow search space by filtering non-optimal formulae with the same ScaledReg and Scale.

On Thu, Aug 3, 2017 at 6:08 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>
>> On Aug 3, 2017, at 4:46 PM, Wei Mi <wmi at google.com> wrote:
>>
>> I found isAMCompletelyFolded return true for formula like
>> reg(%c1.0104.us.i.i) + -1*reg({0,+,-4}<nsw><%for.body8.us.i.i>) +
>> imm(4) on ARM. Does it make sense?
>
>
> That sounds wrong.

With the following fix, the regression of test.ll will be fixed, but I
am not familiar with ARM so I am not sure whether allowing negative
scale has any other usage. Quentin, does the fix looks correct to you?

Index: lib/Target/ARM/ARMISelLowering.cpp
===================================================================
--- lib/Target/ARM/ARMISelLowering.cpp (revision 309240)
+++ lib/Target/ARM/ARMISelLowering.cpp (working copy)
@@ -12411,7 +12411,6 @@ bool ARMTargetLowering::isLegalAddressin
     case MVT::i1:
     case MVT::i8:
     case MVT::i32:
-      if (Scale < 0) Scale = -Scale;
       if (Scale == 1)
         return true;
       // r + r << imm



>
>> The fact is we cannot fold it into
>> load/store and it leads to the additional instructions causing
>> performance regressions.
>>
>>
>> On Thu, Aug 3, 2017 at 11:12 AM, Wei Mi via Phabricator
>> <reviews at reviews.llvm.org> wrote:
>>> wmi added a comment.
>>>
>>> In https://reviews.llvm.org/D34583#830286, @eastig wrote:
>>>
>>>> F4171725: test.ll <https://reviews.llvm.org/F4171725>
>>>>
>>>> F4171724: test.good.ll <https://reviews.llvm.org/F4171724>
>>>>
>>>> F4171723: test.bad.ll <https://reviews.llvm.org/F4171723>
>>>>
>>>> This patch caused regressions from 5% to 23% in two our internal benchmarks on Cortex-M23 and Cortex-M0+. I attached test.ll which is reduced from the benchmarks. I used LLVM revision 309830. 'test.good.ll' is a result when filtering is disabled. 'test.bad.ll' is a result when filtering is enabled.
>>>> Comparing them I can see that this optimization changes how an induction variable is changed. Originally it is incremented from 0 to 256. The optimization changes this into decrementing from 0 to -256. This induction variable is also used as an offset to memory. So to preserve this semantic conversion of the induction variable from a negative value to a positive value is inserted. This is lowered to additional instructions which causes performance regressions.
>>>>
>>>> Could you please have a look at this issue?
>>>>
>>>> Thanks,
>>>> Evgeny Astigeevich
>>>> The ARM Compiler Optimization team leader
>>>
>>>
>>> Hi Evgeny,
>>>
>>> Thanks for providing the testcase.
>>>
>>> It looks like an existing issue in LSR cost evaluation exposed by the patch. Actually, comparing the trace by adding -debug-only=loop-reduce, all the candidates choosen by LSR without filtering are kept in the candidate set after adding the filter patch. However filtering patch provides some more candidates interesting for LSR cost model to choose, and LSR chooses a different set of candidates in the final result which it thinks better (1 less base add) but actually not. We can see that in the trace:
>>>
>>> LSR without the filtering patch:
>>> The chosen solution requires 5 regs, with addrec cost 2, plus 2 base adds, plus 4 imm cost, plus 1 setup cost:
>>>
>>>  LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup type: i8*
>>>    reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) + 1*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)
>>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>    reg(256) + -1*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)
>>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
>>>    reg(%c0.0103.us.i.i) + 4*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)
>>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0,4}, widest fixup type: i32*
>>>    reg({(-4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>)
>>>  LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*
>>>    reg(%c0.0103.us.i.i)
>>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
>>>    reg(%c1.0104.us.i.i) + 4*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>) + imm(4)
>>>  LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*
>>>    reg(%c1.0104.us.i.i)
>>>
>>> LSR with the filtering patch:
>>> The chosen solution requires 5 regs, with addrec cost 2, plus 1 base add, plus 4 imm cost, plus 1 setup cost:
>>>
>>>  LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup type: i8*
>>>    reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) + -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>    reg(-256) + -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
>>>    reg(%c0.0103.us.i.i) + -4*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0,4}, widest fixup type: i32*
>>>    reg({(4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>) + imm(-8)
>>>  LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*
>>>    reg(%c0.0103.us.i.i)
>>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
>>>    reg({(4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>)
>>>  LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*
>>>    reg(%c1.0104.us.i.i)
>>>
>>> The real problem is that LSR has no idea about the cost of getting negative value. It thinks 4*reg({0,+,-1} and -4*reg({0,+,-1} have the same cost.
>>>
>>>  LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup type: i8*
>>>    reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) + -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
>>>    reg(%c0.0103.us.i.i) + -4*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>>
>>> I will think about how to fix it.
>>>
>>> Wei.
>>>
>>>
>>> Repository:
>>>  rL LLVM
>>>
>>> https://reviews.llvm.org/D34583
>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170804/242338f2/attachment.html>


More information about the llvm-commits mailing list