<div dir="ltr">Hi Sergey,<div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span style="font-family:arial,sans-serif;font-size:13px">Also, generated code doesn't always like "ldp; stp; ldp; stp", but can<br>

</span><span style="font-family:arial,sans-serif;font-size:13px">be either of form:<br></span><span style="font-family:arial,sans-serif;font-size:13px">  ldp<br></span><span style="font-family:arial,sans-serif;font-size:13px">  ...something...<br>

</span><span style="font-family:arial,sans-serif;font-size:13px">  stp<br></span><span style="font-family:arial,sans-serif;font-size:13px">or:<br></span><span style="font-family:arial,sans-serif;font-size:13px">  ldp<br></span><span style="font-family:arial,sans-serif;font-size:13px">  ldp<br>

</span><span style="font-family:arial,sans-serif;font-size:13px">  stp<br></span><span style="font-family:arial,sans-serif;font-size:13px">  stp<br></span><span style="font-family:arial,sans-serif;font-size:13px">Is it OK for performance?</span></blockquote>

<div><span style="font-family:arial,sans-serif;font-size:13px"><br></span></div><div><span style="font-family:arial,sans-serif;font-size:13px">I'm afraid that would be a performance regression. That sequence requires twice the number of live registers. For performance it is important to have the ldp and stp properly interleaved, both on A57 and A53.</span></div>

<div><span style="font-family:arial,sans-serif;font-size:13px"><br></span></div><div><span style="font-family:arial,sans-serif;font-size:13px">I also suspect that f128 logic was put in for Cyclone so Apple might get annoyed if you remove it ;)</span></div>

<div><span style="font-family:arial,sans-serif;font-size:13px"><br></span></div><div><span style="font-family:arial,sans-serif;font-size:13px">Cheers,</span></div><div><span style="font-family:arial,sans-serif;font-size:13px"><br>

</span></div><div><span style="font-family:arial,sans-serif;font-size:13px">James</span></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 16 July 2014 15:33, Sergey Dmitrouk <span dir="ltr"><<a href="mailto:sdmitrouk@accesssoftek.com" target="_blank">sdmitrouk@accesssoftek.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi again,<br>

<br>

I was going through code looking what's wrong with generic load&store<br>

generation, but found the issue in test.  Using "undef" as destination<br>

for "llvm.memcpy" as it was in other test is bad idea, code generator<br>

actually produces meaningless code in this case.  Destination pointer<br>

is not advanced for store instructions (offset is just ignored), which<br>

prevents ldp/stp pass to squish sequences of ldr/str instructions.<br>

<br>

So I fixed the test and this time provide two patches: one for<br>

Cortex-A53 only and the other one for all subtargets.  Take a look at<br>

the second one, I don't really like that it basically removes code for<br>

128-bit types, maybe there is a reason why it was put there.<br>

<br>

Also, generated code doesn't always like "ldp; stp; ldp; stp", but can<br>

be either of form:<br>

  ldp<br>

  ...something...<br>

  stp<br>

or:<br>

  ldp<br>

  ldp<br>

  stp<br>

  stp<br>

Is it OK for performance?<br>

<br>

Regards,<br>

Sergey<br>

<div class="HOEnZb"><div class="h5"><br>

On Wed, Jul 16, 2014 at 12:08:51AM +0300, Sergey Dmitrouk wrote:<br>

> Hi James,<br>

><br>

> Sorry, I missed that difference.  It shouldn't be hard to do that ideal<br>

> sequence for all targets though.  One should provide custom<br>

> implementation for AArch64SelectionDAGInfo::EmitTargetCodeForMemcpy, which<br>

> was my initial way of doing this task.  Code in that callback can generate<br>

> any instructions, so I'll try to do that.<br>

><br>

> The only issue is that SelectionDAG prefers its own load&store generator and<br>

> invokes it before the custom one.  I don't know a good way to overcome<br>

> that at the moment.  It's possible to set unreasonable limits for<br>

> instructions emitted by generic load&store implementation, which will cause<br>

> SelectionDAG to reject it, but it's a hack, which I'd prefer not to<br>

> implement.  There should be a better way.<br>

><br>

> Thanks & Cheers,<br>

> Sergey<br>

><br>

> On Tue, Jul 15, 2014 at 12:54:52PM -0700, James Molloy wrote:<br>

> >    Hi Sergey,<br>

> >    Thanks for working on this! The answer is slightly more involved though, I<br>

> >    think.<br>

> >    As shown in your testcase, your change emits the sequence "ldr; str; ldr;<br>

> >    str". The ideal expansion is "ldp; stp; ldp; stp;". That way we still do<br>

> >    128-bit loads and stores.<br>

> >    In fact, our microarchitects have recommended (through internal channels)<br>

> >    that the "ldp; stp" sequence be used for memcpy-like operations - this<br>

> >    will give portable performance. Therefore, the change should also be made<br>

> >    for at least A57. I'll let Tim or Jim comment on Cyclone.<br>

> >    So to generate "ldp stp", the inline memcpy expander needs to generate<br>

> >    "ldr; ldr; str; str;". The ldp/stp pass will then squish these together.<br>

> >    A similar thing is done in the ARM target (which gets combined into LDRD<br>

> >    or LDM), but it's ARM-only. I think some logic needs to me moved into the<br>

> >    target-independent part of codegen.<br>

> >    Cheers,<br>

> >    James<br>

> ><br>

> >    On 15 July 2014 09:15, Sergey Dmitrouk <<a href="mailto:sdmitrouk@accesssoftek.com">sdmitrouk@accesssoftek.com</a>> wrote:<br>

> ><br>

> >      Hi,<br>

> ><br>

> >      Basing on the following information from [this post][0] by James Molloy:<br>

> ><br>

> >      A  * Our inline memcpy expansion pass is emitting "LDR q0, [..]; STR q0,<br>

> >      A  [..]" pairs, which is less than ideal on A53. If we switched to<br>

> >      A  emitting "LDP x0, x1, [..]; STP x0, x1, [..]", we'd get around 30%<br>

> >      A  better inline memcpy performance on A53. A57 seems to deal well with<br>

> >      A  the LDR q sequence.<br>

> ><br>

> >      I've made a patch (attached) that does this for Cortex-A53. A Please<br>

> >      take a look at it.<br>

> ><br>

> >      Best regards,<br>

> >      Sergey<br>

> ><br>

> >      0: <a href="http://article.gmane.org/gmane.comp.compilers.llvm.devel/74269" target="_blank">http://article.gmane.org/gmane.comp.compilers.llvm.devel/74269</a><br>

> ><br>

> >      _______________________________________________<br>

> >      llvm-commits mailing list<br>

> >      <a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a><br>

> >      <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br>

</div></div></blockquote></div><br></div>