<html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">CC Jim and Tim<br><div><div>On Jul 16, 2014, at 7:46 AM, James Molloy <<a href="mailto:james@jamesmolloy.co.uk">james@jamesmolloy.co.uk</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div dir="ltr">Hi Sergey,<div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span style="font-family:arial,sans-serif;font-size:13px">Also, generated code doesn't always like "ldp; stp; ldp; stp", but can<br>
</span><span style="font-family:arial,sans-serif;font-size:13px">be either of form:<br></span><span style="font-family:arial,sans-serif;font-size:13px"> ldp<br></span><span style="font-family:arial,sans-serif;font-size:13px"> ...something...<br>
</span><span style="font-family:arial,sans-serif;font-size:13px"> stp<br></span><span style="font-family:arial,sans-serif;font-size:13px">or:<br></span><span style="font-family:arial,sans-serif;font-size:13px"> ldp<br></span><span style="font-family:arial,sans-serif;font-size:13px"> ldp<br>
</span><span style="font-family:arial,sans-serif;font-size:13px"> stp<br></span><span style="font-family:arial,sans-serif;font-size:13px"> stp<br></span><span style="font-family:arial,sans-serif;font-size:13px">Is it OK for performance?</span></blockquote>
<div><span style="font-family:arial,sans-serif;font-size:13px"><br></span></div><div><span style="font-family:arial,sans-serif;font-size:13px">I'm afraid that would be a performance regression. That sequence requires twice the number of live registers. For performance it is important to have the ldp and stp properly interleaved, both on A57 and A53.</span></div>
<div><span style="font-family:arial,sans-serif;font-size:13px"><br></span></div><div><span style="font-family:arial,sans-serif;font-size:13px">I also suspect that f128 logic was put in for Cyclone so Apple might get annoyed if you remove it ;)</span></div>
<div><span style="font-family:arial,sans-serif;font-size:13px"><br></span></div><div><span style="font-family:arial,sans-serif;font-size:13px">Cheers,</span></div><div><span style="font-family:arial,sans-serif;font-size:13px"><br>
</span></div><div><span style="font-family:arial,sans-serif;font-size:13px">James</span></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 16 July 2014 15:33, Sergey Dmitrouk <span dir="ltr"><<a href="mailto:sdmitrouk@accesssoftek.com" target="_blank">sdmitrouk@accesssoftek.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi again,<br>
<br>
I was going through code looking what's wrong with generic load&store<br>
generation, but found the issue in test. Using "undef" as destination<br>
for "llvm.memcpy" as it was in other test is bad idea, code generator<br>
actually produces meaningless code in this case. Destination pointer<br>
is not advanced for store instructions (offset is just ignored), which<br>
prevents ldp/stp pass to squish sequences of ldr/str instructions.<br>
<br>
So I fixed the test and this time provide two patches: one for<br>
Cortex-A53 only and the other one for all subtargets. Take a look at<br>
the second one, I don't really like that it basically removes code for<br>
128-bit types, maybe there is a reason why it was put there.<br>
<br>
Also, generated code doesn't always like "ldp; stp; ldp; stp", but can<br>
be either of form:<br>
ldp<br>
...something...<br>
stp<br>
or:<br>
ldp<br>
ldp<br>
stp<br>
stp<br>
Is it OK for performance?<br>
<br>
Regards,<br>
Sergey<br>
<div class="HOEnZb"><div class="h5"><br>
On Wed, Jul 16, 2014 at 12:08:51AM +0300, Sergey Dmitrouk wrote:<br>
> Hi James,<br>
><br>
> Sorry, I missed that difference. It shouldn't be hard to do that ideal<br>
> sequence for all targets though. One should provide custom<br>
> implementation for AArch64SelectionDAGInfo::EmitTargetCodeForMemcpy, which<br>
> was my initial way of doing this task. Code in that callback can generate<br>
> any instructions, so I'll try to do that.<br>
><br>
> The only issue is that SelectionDAG prefers its own load&store generator and<br>
> invokes it before the custom one. I don't know a good way to overcome<br>
> that at the moment. It's possible to set unreasonable limits for<br>
> instructions emitted by generic load&store implementation, which will cause<br>
> SelectionDAG to reject it, but it's a hack, which I'd prefer not to<br>
> implement. There should be a better way.<br>
><br>
> Thanks & Cheers,<br>
> Sergey<br>
><br>
> On Tue, Jul 15, 2014 at 12:54:52PM -0700, James Molloy wrote:<br>
> > Hi Sergey,<br>
> > Thanks for working on this! The answer is slightly more involved though, I<br>
> > think.<br>
> > As shown in your testcase, your change emits the sequence "ldr; str; ldr;<br>
> > str". The ideal expansion is "ldp; stp; ldp; stp;". That way we still do<br>
> > 128-bit loads and stores.<br>
> > In fact, our microarchitects have recommended (through internal channels)<br>
> > that the "ldp; stp" sequence be used for memcpy-like operations - this<br>
> > will give portable performance. Therefore, the change should also be made<br>
> > for at least A57. I'll let Tim or Jim comment on Cyclone.<br>
> > So to generate "ldp stp", the inline memcpy expander needs to generate<br>
> > "ldr; ldr; str; str;". The ldp/stp pass will then squish these together.<br>
> > A similar thing is done in the ARM target (which gets combined into LDRD<br>
> > or LDM), but it's ARM-only. I think some logic needs to me moved into the<br>
> > target-independent part of codegen.<br>
> > Cheers,<br>
> > James<br>
> ><br>
> > On 15 July 2014 09:15, Sergey Dmitrouk <<a href="mailto:sdmitrouk@accesssoftek.com">sdmitrouk@accesssoftek.com</a>> wrote:<br>
> ><br>
> > Hi,<br>
> ><br>
> > Basing on the following information from [this post][0] by James Molloy:<br>
> ><br>
> > A * Our inline memcpy expansion pass is emitting "LDR q0, [..]; STR q0,<br>
> > A [..]" pairs, which is less than ideal on A53. If we switched to<br>
> > A emitting "LDP x0, x1, [..]; STP x0, x1, [..]", we'd get around 30%<br>
> > A better inline memcpy performance on A53. A57 seems to deal well with<br>
> > A the LDR q sequence.<br>
> ><br>
> > I've made a patch (attached) that does this for Cortex-A53. A Please<br>
> > take a look at it.<br>
> ><br>
> > Best regards,<br>
> > Sergey<br>
> ><br>
> > 0: <a href="http://article.gmane.org/gmane.comp.compilers.llvm.devel/74269" target="_blank">http://article.gmane.org/gmane.comp.compilers.llvm.devel/74269</a><br>
> ><br>
> > _______________________________________________<br>
> > llvm-commits mailing list<br>
> > <a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a><br>
> > <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br>
</div></div></blockquote></div><br></div>
_______________________________________________<br>llvm-commits mailing list<br><a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a><br>http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits<br></blockquote></div><br></body></html>