<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Exchange Server">
<!-- converted from text --><style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>
</head>
<body>
<meta content="text/html; charset=UTF-8">
<style type="text/css" style="">
<!--
p
{margin-top:0;
margin-bottom:0}
-->
</style>
<div dir="ltr">
<div id="x_divtagdefaultwrapper" dir="ltr" style="font-size:11pt; color:#000000; font-family:Calibri,Helvetica,sans-serif">
<p>Hi Wei,</p>
<p><br>
</p>
<p>The fix will cause a failure of <span>CodeGen/ARM/arm-negative-stride.ll. The check was introduced in r35859 (10 years ago):</span></p>
<p><span></span></p>
<div><br>
</div>
<div></div>
<div>------------------------------------------------------------------------</div>
<div>r35859 | lattner | 2007-04-10 04:48:29 +0100 (Tue, 10 Apr 2007) | 2 lines</div>
<div><br>
</div>
<div>restore support for negative strides</div>
<div><br>
</div>
<div>------------------------------------------------------------------------</div>
<div><br>
</div>
<div>The test has a triple: <span>arm-eabi. So code is generated for <span>arm7tdmi (armv4). Maybe we need a case for the M-profile. I'll try to figure it out.</span></span></div>
<div><span><span><br>
</span></span></div>
<div><span><span>Thanks,</span></span></div>
<div><span style="font-family:"Segoe UI","Segoe UI Emoji","Segoe UI Symbol",Lato,"Helvetica Neue",Helvetica,Arial,sans-serif; font-size:13px">Evgeny Astigeevich</span><br>
</div>
<span style="font-family:"Segoe UI","Segoe UI Emoji","Segoe UI Symbol",Lato,"Helvetica Neue",Helvetica,Arial,sans-serif; font-size:13px">The ARM Compiler Optimization team leader</span><br>
<p></p>
</div>
<hr tabindex="-1" style="display:inline-block; width:98%">
<div id="x_divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> Wei Mi <wmi@google.com><br>
<b>Sent:</b> Friday, August 4, 2017 5:23:47 PM<br>
<b>To:</b> Quentin Colombet<br>
<b>Cc:</b> reviews+D34583+public+14017146d8617607@reviews.llvm.org; Evgeny Stupachenko; Xinliang David Li; Sanjoy Das; Evgeny Astigeevich; Matthias Braun; Dehao Chen; llvm-commits; Michael Zolotukhin<br>
<b>Subject:</b> Re: [PATCH] D34583: [LSR] Narrow search space by filtering non-optimal formulae with the same ScaledReg and Scale.</font>
<div> </div>
</div>
</div>
<font size="2"><span style="font-size:10pt;">
<div class="PlainText">On Thu, Aug 3, 2017 at 6:08 PM, Quentin Colombet <qcolombet@apple.com> wrote:<br>
><br>
>> On Aug 3, 2017, at 4:46 PM, Wei Mi <wmi@google.com> wrote:<br>
>><br>
>> I found isAMCompletelyFolded return true for formula like<br>
>> reg(%c1.0104.us.i.i) + -1*reg({0,+,-4}<nsw><%for.body8.us.i.i>) +<br>
>> imm(4) on ARM. Does it make sense?<br>
><br>
><br>
> That sounds wrong.<br>
<br>
With the following fix, the regression of test.ll will be fixed, but I<br>
am not familiar with ARM so I am not sure whether allowing negative<br>
scale has any other usage. Quentin, does the fix looks correct to you?<br>
<br>
Index: lib/Target/ARM/ARMISelLowering.cpp<br>
===================================================================<br>
--- lib/Target/ARM/ARMISelLowering.cpp (revision 309240)<br>
+++ lib/Target/ARM/ARMISelLowering.cpp (working copy)<br>
@@ -12411,7 +12411,6 @@ bool ARMTargetLowering::isLegalAddressin<br>
case MVT::i1:<br>
case MVT::i8:<br>
case MVT::i32:<br>
- if (Scale < 0) Scale = -Scale;<br>
if (Scale == 1)<br>
return true;<br>
// r + r << imm<br>
<br>
<br>
<br>
><br>
>> The fact is we cannot fold it into<br>
>> load/store and it leads to the additional instructions causing<br>
>> performance regressions.<br>
>><br>
>><br>
>> On Thu, Aug 3, 2017 at 11:12 AM, Wei Mi via Phabricator<br>
>> <reviews@reviews.llvm.org> wrote:<br>
>>> wmi added a comment.<br>
>>><br>
>>> In <a href="https://reviews.llvm.org/D34583#830286">https://reviews.llvm.org/D34583#830286</a>, @eastig wrote:<br>
>>><br>
>>>> F4171725: test.ll <<a href="https://reviews.llvm.org/F4171725">https://reviews.llvm.org/F4171725</a>><br>
>>>><br>
>>>> F4171724: test.good.ll <<a href="https://reviews.llvm.org/F4171724">https://reviews.llvm.org/F4171724</a>><br>
>>>><br>
>>>> F4171723: test.bad.ll <<a href="https://reviews.llvm.org/F4171723">https://reviews.llvm.org/F4171723</a>><br>
>>>><br>
>>>> This patch caused regressions from 5% to 23% in two our internal benchmarks on Cortex-M23 and Cortex-M0+. I attached test.ll which is reduced from the benchmarks. I used LLVM revision 309830. 'test.good.ll' is a result when filtering is disabled. 'test.bad.ll'
is a result when filtering is enabled.<br>
>>>> Comparing them I can see that this optimization changes how an induction variable is changed. Originally it is incremented from 0 to 256. The optimization changes this into decrementing from 0 to -256. This induction variable is also used as an offset
to memory. So to preserve this semantic conversion of the induction variable from a negative value to a positive value is inserted. This is lowered to additional instructions which causes performance regressions.<br>
>>>><br>
>>>> Could you please have a look at this issue?<br>
>>>><br>
>>>> Thanks,<br>
>>>> Evgeny Astigeevich<br>
>>>> The ARM Compiler Optimization team leader<br>
>>><br>
>>><br>
>>> Hi Evgeny,<br>
>>><br>
>>> Thanks for providing the testcase.<br>
>>><br>
>>> It looks like an existing issue in LSR cost evaluation exposed by the patch. Actually, comparing the trace by adding -debug-only=loop-reduce, all the candidates choosen by LSR without filtering are kept in the candidate set after adding the filter patch.
However filtering patch provides some more candidates interesting for LSR cost model to choose, and LSR chooses a different set of candidates in the final result which it thinks better (1 less base add) but actually not. We can see that in the trace:<br>
>>><br>
>>> LSR without the filtering patch:<br>
>>> The chosen solution requires 5 regs, with addrec cost 2, plus 2 base adds, plus 4 imm cost, plus 1 setup cost:<br>
>>><br>
>>> LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup type: i8*<br>
>>> reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) + 1*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)<br>
>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32<br>
>>> reg(256) + -1*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)<br>
>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*<br>
>>> reg(%c0.0103.us.i.i) + 4*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>)<br>
>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0,4}, widest fixup type: i32*<br>
>>> reg({(-4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>)<br>
>>> LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*<br>
>>> reg(%c0.0103.us.i.i)<br>
>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*<br>
>>> reg(%c1.0104.us.i.i) + 4*reg({0,+,1}<nuw><nsw><%for.body8.us.i.i>) + imm(4)<br>
>>> LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*<br>
>>> reg(%c1.0104.us.i.i)<br>
>>><br>
>>> LSR with the filtering patch:<br>
>>> The chosen solution requires 5 regs, with addrec cost 2, plus 1 base add, plus 4 imm cost, plus 1 setup cost:<br>
>>><br>
>>> LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup type: i8*<br>
>>> reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) + -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)<br>
>>> LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32<br>
>>> reg(-256) + -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)<br>
>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*<br>
>>> reg(%c0.0103.us.i.i) + -4*reg({0,+,-1}<nw><%for.body8.us.i.i>)<br>
>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0,4}, widest fixup type: i32*<br>
>>> reg({(4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>) + imm(-8)<br>
>>> LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*<br>
>>> reg(%c0.0103.us.i.i)<br>
>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*<br>
>>> reg({(4 + %c1.0104.us.i.i)<nsw>,+,4}<nsw><%for.body8.us.i.i>)<br>
>>> LSR Use: Kind=Special, Offsets={0}, all-fixups-outside-loop, widest fixup type: i32*<br>
>>> reg(%c1.0104.us.i.i)<br>
>>><br>
>>> The real problem is that LSR has no idea about the cost of getting negative value. It thinks 4*reg({0,+,-1} and -4*reg({0,+,-1} have the same cost.<br>
>>><br>
>>> LSR Use: Kind=Address of i8 in addrspace(0), Offsets={0}, widest fixup type: i8*<br>
>>> reg({%ptr1,+,256}<%for.cond6.preheader.us.i.i>) + -1*reg({0,+,-1}<nw><%for.body8.us.i.i>)<br>
>>> LSR Use: Kind=Address of i32 in addrspace(0), Offsets={0}, widest fixup type: i32*<br>
>>> reg(%c0.0103.us.i.i) + -4*reg({0,+,-1}<nw><%for.body8.us.i.i>)<br>
>>><br>
>>> I will think about how to fix it.<br>
>>><br>
>>> Wei.<br>
>>><br>
>>><br>
>>> Repository:<br>
>>> rL LLVM<br>
>>><br>
>>> <a href="https://reviews.llvm.org/D34583">https://reviews.llvm.org/D34583</a><br>
>>><br>
>>><br>
>>><br>
><br>
</div>
</span></font>
</body>
</html>