[PATCH][AArch64] Use 8-byte load&store for inlined memcpy() on Cortex A53

Tue Jul 15 14:08:51 PDT 2014

Hi James,

Sorry, I missed that difference.  It shouldn't be hard to do that ideal
sequence for all targets though.  One should provide custom
implementation for AArch64SelectionDAGInfo::EmitTargetCodeForMemcpy, which
was my initial way of doing this task.  Code in that callback can generate
any instructions, so I'll try to do that.

The only issue is that SelectionDAG prefers its own load&store generator and
invokes it before the custom one.  I don't know a good way to overcome
that at the moment.  It's possible to set unreasonable limits for
instructions emitted by generic load&store implementation, which will cause
SelectionDAG to reject it, but it's a hack, which I'd prefer not to
implement.  There should be a better way.

Thanks & Cheers,
Sergey

On Tue, Jul 15, 2014 at 12:54:52PM -0700, James Molloy wrote:
>    Hi Sergey,
>    Thanks for working on this! The answer is slightly more involved though, I
>    think.
>    As shown in your testcase, your change emits the sequence "ldr; str; ldr;
>    str". The ideal expansion is "ldp; stp; ldp; stp;". That way we still do
>    128-bit loads and stores.
>    In fact, our microarchitects have recommended (through internal channels)
>    that the "ldp; stp" sequence be used for memcpy-like operations - this
>    will give portable performance. Therefore, the change should also be made
>    for at least A57. I'll let Tim or Jim comment on Cyclone.
>    So to generate "ldp stp", the inline memcpy expander needs to generate
>    "ldr; ldr; str; str;". The ldp/stp pass will then squish these together.
>    A similar thing is done in the ARM target (which gets combined into LDRD
>    or LDM), but it's ARM-only. I think some logic needs to me moved into the
>    target-independent part of codegen.
>    Cheers,
>    James
> 
>    On 15 July 2014 09:15, Sergey Dmitrouk <sdmitrouk at accesssoftek.com> wrote:
> 
>      Hi,
> 
>      Basing on the following information from [this post][0] by James Molloy:
> 
>      A  * Our inline memcpy expansion pass is emitting "LDR q0, [..]; STR q0,
>      A  [..]" pairs, which is less than ideal on A53. If we switched to
>      A  emitting "LDP x0, x1, [..]; STP x0, x1, [..]", we'd get around 30%
>      A  better inline memcpy performance on A53. A57 seems to deal well with
>      A  the LDR q sequence.
> 
>      I've made a patch (attached) that does this for Cortex-A53. A Please
>      take a look at it.
> 
>      Best regards,
>      Sergey
> 
>      0: http://article.gmane.org/gmane.comp.compilers.llvm.devel/74269
> 
>      _______________________________________________
>      llvm-commits mailing list
>      llvm-commits at cs.uiuc.edu
>      http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits