<div dir="ltr"><div>>> We've also seen similar instances where multiple registers are used to compute very similar<br>
>> addresses (such as x+0 and x+4!) and this increases register pressure.<br>
<br>I don't have an ARM enabled build of the tools to test with, but I suspect what I'm seeing here:<br><a href="http://llvm.org/bugs/show_bug.cgi?id=20134">http://llvm.org/bugs/show_bug.cgi?id=20134</a><br></div>
<br><div class="gmail_extra">...would also be bad on AArch64.<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jun 25, 2014 at 8:58 PM, Manjunath DN <span dir="ltr"><<a href="mailto:manjunath.dn@gmail.com" target="_blank">manjunath.dn@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>HI James,</div><div>Thanks for your reply and hints on what can be done for the Aarch64 backend optimization for llvm</div>
<div>We have SPEC license and v8 hardware. So I will start looking into it</div>
<div>warm regards</div><div>Manjunath</div><div> </div></div><div class="gmail_extra"><div><div class="h5"><br><br><div class="gmail_quote">On Wed, Jun 25, 2014 at 8:42 PM, James Molloy <span dir="ltr"><<a href="mailto:james.molloy@arm.com" target="_blank">james.molloy@arm.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Manjunath,<br>
<br>
At the time of writing that status we had only done our initial analysis.<br>
This was done without real hardware and attempted to identify poor code<br>
sequences but we were unable to quantify how much effect this would actually<br>
have.<br>
<br>
Since then we've done more analysis using Cortex-A57 and Cortex-A53 on an<br>
internal development platform.<br>
<br>
For SPEC, we are between 10% and 0% behind GCC on 9 benchmarks, and 25%<br>
ahead on one benchmark. Most benchmarks are less than 5% behind GCC.<br>
<br>
Because of the licencing of SPEC, I have to be quite restricted in what I<br>
say and I can't give any numbers - sorry about that.<br>
<br>
We are focussing on Cortex-A57, and the things we've identified so far are:<br>
* The CSEL instruction behaves worse than the equivalent branch structure<br>
in at least one benchmark. In an out of order core, select-like instructions<br>
are going to be slower than their branched equivalent if the branch is<br>
predictable due to CSEL having two dependencies.<br>
<br>
* Redundant calculations inside if conditions. We've seen:<br>
1. "if (a[x].b < c[y].d || a[x].e > c[y].f)" - the calculations of a[x]<br>
and c[y] are repeated, when they are common. We've also seen similar<br>
instances where multiple registers are used to compute very similar<br>
addresses (such as x+0 and x+4!) and this increases register pressure.<br>
2. "if (a < 0 && b == c || a > 0 && b == d)" - the first comparison of<br>
'a' against zero is done twice, when the flag results of the first<br>
comparison could be used for the second comparison.<br>
<br>
* For a loop such as "for (i = 0; i < n; ++i)<br>
{do_something_with(&x[i]);}", GCC is using &x[i] as the loop induction<br>
variable where LLVM uses i and performs the calculation &x[i] on every<br>
iteration. This only creates one more add instruction but the loop we see it<br>
in only has 5 or so instructions.<br>
<br>
* The inline heuristics are way behind GCC's. If we crank the inline<br>
threshold up to 1000, we can remove a 6.5% performance regression from one<br>
benchmark entirely.<br>
<br>
* We're generating (due to SLP vectorizer and a DAG combine) loads into Q<br>
registers when merging consecutive loads. This is bad, because there are no<br>
callee-saved Q registers! So if the live range crosses a function call, it<br>
will have to be immediately spilled again. This can be easily fixed by<br>
using load-pair instructions instead. I have a patch to fix this.<br>
<br>
The list above is non-exhaustive and only contains things that we think may<br>
affect multiple benchmarks or real-world code.<br>
<br>
I've also noticed:<br>
* Our inline memcpy expansion pass is emitting "LDR q0, [..]; STR q0,<br>
[..]" pairs, which is less than ideal on A53. If we switched to emitting<br>
"LDP x0, x1, [..]; STP x0, x1, [..]", we'd get around 30% better inline<br>
memcpy performance on A53. A57 seems to deal well with the LDR q sequence.<br>
<br>
I'm sorry I'm unable to provide code samples for most of the issues found so<br>
far - this is an artefact of them having come from SPEC. Trivial examples do<br>
not always show the same behaviour, and as we're still investigating we<br>
haven't yet been able to reduce most of these to an anonymisable testcase.<br>
<br>
Hope this helps, but doubt it does,<br>
<br>
James<br>
<br>
> -----Original Message-----<br>
> From: <a href="mailto:llvmdev-bounces@cs.uiuc.edu" target="_blank">llvmdev-bounces@cs.uiuc.edu</a> [mailto:<a href="mailto:llvmdev-bounces@cs.uiuc.edu" target="_blank">llvmdev-bounces@cs.uiuc.edu</a>] On<br>
> Behalf Of Manjunath N<br>
> Sent: 24 June 2014 10:45<br>
> To: <a href="mailto:llvmdev@cs.uiuc.edu" target="_blank">llvmdev@cs.uiuc.edu</a><br>
> Subject: Re: [LLVMdev] Contributing the Apple ARM64 compiler backend<br>
><br>
><br>
><br>
> Eric Christopher <echristo <at> <a href="http://gmail.com" target="_blank">gmail.com</a>> writes:<br>
><br>
> ><br>
> > > The big pain issues I see merging from ARM64 to AArch64 are:<br>
> > > 1. Apple have created a fairly complete scheduling model already<br>
> for<br>
> > > ARM64, and we'd have to merge the partial? model in AArch64 and<br>
> theirs.<br>
> We<br>
> > > risk regressing performance on Apple's targets here, and we can't<br>
> determine<br>
> > > ourselves whether we have or not. This is not ideal.<br>
> > > 2. Porting over the DAG-to-DAG optimizations and any other<br>
> > > optimizations that rely on the tablegen layout will be very tricky.<br>
> > > 3. The conditional compare pass is fairly comprehensive - we'd<br>
> have<br>
> to<br>
> > > port that over or rewrite it and that would be a lot of work.<br>
> > > 4. A very quick analysis last night indicated that ARM64 has<br>
> > > implemented just under half of the optimizations we discovered<br>
> opportunities<br>
> > > for in SPEC and EEMBC. That's a fairly comprehensive number of<br>
> > > optimizations, and they won't all be easy to port.<br>
> Eric,<br>
> You mention that there a quite a few optimization opportunities in SPEC<br>
> 2000/ EEMBC.<br>
> I am looking to optimize the Aarch64 backend. Could you please let me know<br>
> the big optimizations possible?<br>
><br>
><br>
><br>
> _______________________________________________<br>
> LLVM Developers mailing list<br>
> <a href="mailto:LLVMdev@cs.uiuc.edu" target="_blank">LLVMdev@cs.uiuc.edu</a> <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>
> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>
<br>
<br>
<br>
<br>
</blockquote></div><br><br clear="all"><br></div></div><span class=""><font color="#888888">-- <br><div>=========================================<br>warm regards,<br>Manjunath DN<br></div>
</font></span></div>
<br>_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a> <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>
<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>
<br></blockquote></div><br><br clear="all"><br>-- <br>Sanjay Patel<br>RotateRight, LLC<br><a href="http://www.rotateright.com">http://www.rotateright.com</a>
</div></div>