[PATCH] D34583: [LSR] Narrow search space by filtering non-optimal formulae with the same ScaledReg and Scale.

Thu Aug 3 18:08:46 PDT 2017

> On Aug 3, 2017, at 4:46 PM, Wei Mi <wmi at google.com> wrote:
> 
> I found isAMCompletelyFolded return true for formula like
> reg(%c1.0104.us.i.i) + -1*reg({0,+,-4}<nsw><%for.body8.us.i.i>) +
> imm(4) on ARM. Does it make sense?

That sounds wrong.

> The fact is we cannot fold it into
> load/store and it leads to the additional instructions causing
> performance regressions.
> 
> 
> On Thu, Aug 3, 2017 at 11:12 AM, Wei Mi via Phabricator
> <reviews at reviews.llvm.org> wrote:
>> wmi added a comment.
>> 
>> In https://reviews.llvm.org/D34583#830286, @eastig wrote:
>> 
>>> F4171725: test.ll <https://reviews.llvm.org/F4171725>
>>> 
>>> F4171724: test.good.ll <https://reviews.llvm.org/F4171724>
>>> 
>>> F4171723: test.bad.ll <https://reviews.llvm.org/F4171723>
>>> 
>>> This patch caused regressions from 5% to 23% in two our internal benchmarks on Cortex-M23 and Cortex-M0+. I attached test.ll which is reduced from the benchmarks. I used LLVM revision 309830. 'test.good.ll' is a result when filtering is disabled. 'test.bad.ll' is a result when filtering is enabled.
>>> Comparing them I can see that this optimization changes how an induction variable is changed. Originally it is incremented from 0 to 256. The optimization changes this into decrementing from 0 to -256. This induction variable is also used as an offset to memory. So to preserve this semantic conversion of the induction variable from a negative value to a positive value is inserted. This is lowered to additional instructions which causes performance regressions.
>>> 
>>> Could you please have a look at this issue?
>>> 
>>> Thanks,
>>> Evgeny Astigeevich
>>> The ARM Compiler Optimization team leader
>> 
>> 
>> Hi Evgeny,
>> 
>> Thanks for providing the testcase.
>> 
>> It looks like an existing issue in LSR cost evaluation exposed by the patch. Actually, comparing the trace by adding -debug-only=loop-reduce, all the candidates choosen by LSR without filtering are kept in the candidate set after adding the filter patch. However filtering patch provides some more candidates interesting for LSR cost model to choose, and LSR chooses a different set of candidates in the final result which it thinks better (1 less base add) but actually not. We can see that in the trace:
>> 
>> LSR without the filtering patch:
>> The chosen solution requires 5 regs, with addrec cost 2, plus 2 base adds, plus 4 imm cost, plus 1 setup cost:
>> 
>>  LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup type: i8*
>>    reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) + 1*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg(256) + -1*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
>>    reg(%c0.0103.us.i.i) + 4*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0,4}, widest fixup type: i32*
>>    reg({(-4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>)
>>  LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*
>>    reg(%c0.0103.us.i.i)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
>>    reg(%c1.0104.us.i.i) + 4*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>) + imm(4)
>>  LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*
>>    reg(%c1.0104.us.i.i)
>> 
>> LSR with the filtering patch:
>> The chosen solution requires 5 regs, with addrec cost 2, plus 1 base add, plus 4 imm cost, plus 1 setup cost:
>> 
>>  LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup type: i8*
>>    reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) + -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>  LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32
>>    reg(-256) + -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
>>    reg(%c0.0103.us.i.i) + -4*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0,4}, widest fixup type: i32*
>>    reg({(4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>) + imm(-8)
>>  LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*
>>    reg(%c0.0103.us.i.i)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
>>    reg({(4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>)
>>  LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*
>>    reg(%c1.0104.us.i.i)
>> 
>> The real problem is that LSR has no idea about the cost of getting negative value. It thinks 4*reg({0,+,-1} and -4*reg({0,+,-1} have the same cost.
>> 
>>  LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup type: i8*
>>    reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) + -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>>  LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*
>>    reg(%c0.0103.us.i.i) + -4*reg({0,+,-1}<nw><%for.body8.us.i.i>)
>> 
>> I will think about how to fix it.
>> 
>> Wei.
>> 
>> 
>> Repository:
>>  rL LLVM
>> 
>> https://reviews.llvm.org/D34583
>> 
>> 
>>