[PATCH][AArch64] Use 8-byte load&store for inlined memcpy() on Cortex A53
Juergen Ributzka
juergen at apple.com
Wed Jul 16 09:50:17 PDT 2014
CC Jim and Tim
On Jul 16, 2014, at 7:46 AM, James Molloy <james at jamesmolloy.co.uk> wrote:
> Hi Sergey,
>
> Also, generated code doesn't always like "ldp; stp; ldp; stp", but can
> be either of form:
> ldp
> ...something...
> stp
> or:
> ldp
> ldp
> stp
> stp
> Is it OK for performance?
>
> I'm afraid that would be a performance regression. That sequence requires twice the number of live registers. For performance it is important to have the ldp and stp properly interleaved, both on A57 and A53.
>
> I also suspect that f128 logic was put in for Cyclone so Apple might get annoyed if you remove it ;)
>
> Cheers,
>
> James
>
>
> On 16 July 2014 15:33, Sergey Dmitrouk <sdmitrouk at accesssoftek.com> wrote:
> Hi again,
>
> I was going through code looking what's wrong with generic load&store
> generation, but found the issue in test. Using "undef" as destination
> for "llvm.memcpy" as it was in other test is bad idea, code generator
> actually produces meaningless code in this case. Destination pointer
> is not advanced for store instructions (offset is just ignored), which
> prevents ldp/stp pass to squish sequences of ldr/str instructions.
>
> So I fixed the test and this time provide two patches: one for
> Cortex-A53 only and the other one for all subtargets. Take a look at
> the second one, I don't really like that it basically removes code for
> 128-bit types, maybe there is a reason why it was put there.
>
> Also, generated code doesn't always like "ldp; stp; ldp; stp", but can
> be either of form:
> ldp
> ...something...
> stp
> or:
> ldp
> ldp
> stp
> stp
> Is it OK for performance?
>
> Regards,
> Sergey
>
> On Wed, Jul 16, 2014 at 12:08:51AM +0300, Sergey Dmitrouk wrote:
> > Hi James,
> >
> > Sorry, I missed that difference. It shouldn't be hard to do that ideal
> > sequence for all targets though. One should provide custom
> > implementation for AArch64SelectionDAGInfo::EmitTargetCodeForMemcpy, which
> > was my initial way of doing this task. Code in that callback can generate
> > any instructions, so I'll try to do that.
> >
> > The only issue is that SelectionDAG prefers its own load&store generator and
> > invokes it before the custom one. I don't know a good way to overcome
> > that at the moment. It's possible to set unreasonable limits for
> > instructions emitted by generic load&store implementation, which will cause
> > SelectionDAG to reject it, but it's a hack, which I'd prefer not to
> > implement. There should be a better way.
> >
> > Thanks & Cheers,
> > Sergey
> >
> > On Tue, Jul 15, 2014 at 12:54:52PM -0700, James Molloy wrote:
> > > Hi Sergey,
> > > Thanks for working on this! The answer is slightly more involved though, I
> > > think.
> > > As shown in your testcase, your change emits the sequence "ldr; str; ldr;
> > > str". The ideal expansion is "ldp; stp; ldp; stp;". That way we still do
> > > 128-bit loads and stores.
> > > In fact, our microarchitects have recommended (through internal channels)
> > > that the "ldp; stp" sequence be used for memcpy-like operations - this
> > > will give portable performance. Therefore, the change should also be made
> > > for at least A57. I'll let Tim or Jim comment on Cyclone.
> > > So to generate "ldp stp", the inline memcpy expander needs to generate
> > > "ldr; ldr; str; str;". The ldp/stp pass will then squish these together.
> > > A similar thing is done in the ARM target (which gets combined into LDRD
> > > or LDM), but it's ARM-only. I think some logic needs to me moved into the
> > > target-independent part of codegen.
> > > Cheers,
> > > James
> > >
> > > On 15 July 2014 09:15, Sergey Dmitrouk <sdmitrouk at accesssoftek.com> wrote:
> > >
> > > Hi,
> > >
> > > Basing on the following information from [this post][0] by James Molloy:
> > >
> > > A * Our inline memcpy expansion pass is emitting "LDR q0, [..]; STR q0,
> > > A [..]" pairs, which is less than ideal on A53. If we switched to
> > > A emitting "LDP x0, x1, [..]; STP x0, x1, [..]", we'd get around 30%
> > > A better inline memcpy performance on A53. A57 seems to deal well with
> > > A the LDR q sequence.
> > >
> > > I've made a patch (attached) that does this for Cortex-A53. A Please
> > > take a look at it.
> > >
> > > Best regards,
> > > Sergey
> > >
> > > 0: http://article.gmane.org/gmane.comp.compilers.llvm.devel/74269
> > >
> > > _______________________________________________
> > > llvm-commits mailing list
> > > llvm-commits at cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140716/f575740c/attachment.html>
More information about the llvm-commits
mailing list