[PATCH] D34583: [LSR] Narrow search space by filtering non-optimal formulae with the same ScaledReg and Scale.

Fri Aug 4 11:00:52 PDT 2017

On Fri, Aug 4, 2017 at 10:24 AM, Quentin Colombet <qcolombet at apple.com> wrote:
>
>> On Aug 4, 2017, at 9:23 AM, Wei Mi <wmi at google.com> wrote:
>>
>> On Thu, Aug 3, 2017 at 6:08 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>>>
>>>> On Aug 3, 2017, at 4:46 PM, Wei Mi <wmi at google.com> wrote:
>>>>
>>>> I found isAMCompletelyFolded return true for formula like
>>>> reg(%c1.0104.us.i.i) + -1*reg({0,+,-4}<nsw><%for.body8.us.i.i>) +
>>>> imm(4) on ARM. Does it make sense?
>>>
>>>
>>> That sounds wrong.
>>
>> With the following fix, the regression of test.ll will be fixed, but I
>> am not familiar with ARM so I am not sure whether allowing negative
>> scale has any other usage. Quentin, does the fix looks correct to you?
>>
>> Index: lib/Target/ARM/ARMISelLowering.cpp
>> ===================================================================
>> --- lib/Target/ARM/ARMISelLowering.cpp (revision 309240)
>> +++ lib/Target/ARM/ARMISelLowering.cpp (working copy)
>> @@ -12411,7 +12411,6 @@ bool ARMTargetLowering::isLegalAddressin
>>     case MVT::i1:
>>     case MVT::i8:
>>     case MVT::i32:
>> -      if (Scale < 0) Scale = -Scale;
>
> I’d need to have more context, but that looks reasonable.
>
> There might be an additional thing to fix, because just the fact that we had both an immediate and a register for the offset should tell us this cannot be completely folded. That being said, I haven’t checked the call site and it is possible we juste ask about the register part of the formula (that would be strange though).

(gdb) p F.dump()
reg(%c1.0104.us.i.i) + -1*reg({0,+,-4}<nsw><%for.body8.us.i.i>) + imm(4)

(gdb) p isAMCompletelyFolded(TTI, LU, F)
$2 = true

(gdb) p F
$3 = (const (anonymous namespace)::Formula &) @0x7c40b0: {BaseGV =
0x0, BaseOffset = 0, HasBaseReg = true, Scale = -1,
  BaseRegs = {<llvm::SmallVectorImpl<llvm::SCEV const*>> =
{<llvm::SmallVectorTemplateBase<llvm::SCEV const*, true>> =
{<llvm::SmallVectorTemplateCommon<llvm::SCEV const*, void>> =
{<llvm::SmallVectorBase> = {BeginX = 0x7c40e8, EndX = 0x7c40f0,
CapacityX = 0x7c4108}, FirstEl = {<llvm::AlignedCharArray<8ul, 8ul>> =
{
              buffer = "\360Ix\000\000\000\000"}, <No data fields>}},
<No data fields>}, <No data fields>}, Storage = {InlineElts =
{{<llvm::AlignedCharArray<8ul, 8ul>> = {
            buffer = "\000\000\000\000\000\000\000"}, <No data
fields>}, {<llvm::AlignedCharArray<8ul, 8ul>> = {buffer =
"\000\000\000\000\000\000\000"}, <No data fields>},
        {<llvm::AlignedCharArray<8ul, 8ul>> = {buffer =
"\000\000\000\000\000\000\000"}, <No data fields>}}}}, ScaledReg =
0x7b0aa0, UnfoldedOffset = 4}

F.BaseOffset is 0 and the imm(4) is put into UnfoldedOffset and its
cost is considered separately.

Wei.

>
>
>>       if (Scale == 1)
>>         return true;
>>       // r + r << imm
>>
>>
>>
>>>
>>>> The fact is we cannot fold it into
>>>> load/store and it leads to the additional instructions causing
>>>> performance regressions.
>>>>
>>>>
>>>> On Thu, Aug 3, 2017 at 11:12 AM, Wei Mi via Phabricator
>>>> <reviews at reviews.llvm.org> wrote:
>>>>> wmi added a comment.
>>>>>
>>>>> In https://reviews.llvm.org/D34583#830286, @eastig wrote:
>>>>>
>>>>>> F4171725: test.ll <https://reviews.llvm.org/F4171725>
>>>>>>
>>>>>> F4171724: test.good.ll <https://reviews.llvm.org/F4171724>
>>>>>>
>>>>>> F4171723: test.bad.ll <https://reviews.llvm.org/F4171723>
>>>>>>
>>>>>> This patch caused regressions from 5% to 23% in two our internal benchmarks on Cortex-M23 and Cortex-M0+. I attached test.ll which is reduced from the benchmarks. I used LLVM revision 309830. 'test.good.ll' is a result when filtering is disabled. 'test.bad.ll' is a result when filtering is enabled.
>>>>>> Comparing them I can see that this optimization changes how an induction variable is changed. Originally it is incremented from 0 to 256. The optimization changes this into decrementing from 0 to -256. This induction variable is also used as an offset to memory. So to preserve this semantic conversion of the induction variable from a negative value to a positive value is inserted. This is lowered to additional instructions which causes performance regressions.
>>>>>>
>>>>>> Could you please have a look at this issue?
>>>>>>
>>>>>> Thanks,
>>>>>> Evgeny Astigeevich
>>>>>> The ARM Compiler Optimization team leader
>>>>>
>>>>>
>>>>> Hi Evgeny,
>>>>>
>>>>> Thanks for providing the testcase.
>>>>>
>>>>> It looks like an existing issue in LSR cost evaluation exposed by the patch. Actually, comparing the trace by adding -debug-only=loop-reduce, all the candidates choosen by LSR without filtering are kept in the candidate set after adding the filter patch. However filtering patch provides some more candidates interesting for LSR cost model to choose, and LSR chooses a different set of candidates in the final result which it thinks better (1 less base add) but actually not. We can see that in the trace:
>>>>>
>>>>> LSR without the filtering patch:
>>>>> The chosen solution requires 5 regs, with addrec cost 2, plus 2 base adds, plus 4 imm cost, plus 1 setup cost:
>>>>>
>>>>> LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup type: i8*
>>>>>   reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) + 1*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)
>>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>>>   reg(256) + -1*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)
>>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
>>>>>   reg(%c0.0103.us.i.i) + 4*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)
>>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0,4}, widest fixup type: i32*
>>>>>   reg({(-4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>)
>>>>> LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*
>>>>>   reg(%c0.0103.us.i.i)
>>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
>>>>>   reg(%c1.0104.us.i.i) + 4*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>) + imm(4)
>>>>> LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*
>>>>>   reg(%c1.0104.us.i.i)
>>>>>
>>>>> LSR with the filtering patch:
>>>>> The chosen solution requires 5 regs, with addrec cost 2, plus 1 base add, plus 4 imm cost, plus 1 setup cost:
>>>>>
>>>>> LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup type: i8*
>>>>>   reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) + -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>>>>   reg(-256) + -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
>>>>>   reg(%c0.0103.us.i.i) + -4*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0,4}, widest fixup type: i32*
>>>>>   reg({(4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>) + imm(-8)
>>>>> LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*
>>>>>   reg(%c0.0103.us.i.i)
>>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
>>>>>   reg({(4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>)
>>>>> LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*
>>>>>   reg(%c1.0104.us.i.i)
>>>>>
>>>>> The real problem is that LSR has no idea about the cost of getting negative value. It thinks 4*reg({0,+,-1} and -4*reg({0,+,-1} have the same cost.
>>>>>
>>>>> LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup type: i8*
>>>>>   reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) + -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
>>>>>   reg(%c0.0103.us.i.i) + -4*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>>>>
>>>>> I will think about how to fix it.
>>>>>
>>>>> Wei.
>>>>>
>>>>>
>>>>> Repository:
>>>>> rL LLVM
>>>>>
>>>>> https://reviews.llvm.org/D34583
>>>>>
>>>>>
>>>>>
>>>
>