[PATCH][AArch64] Use 8-byte load&store for inlined memcpy() on Cortex A53

Fri Jul 18 09:01:15 PDT 2014

Hi,

We definitely don't want that for A57 or A53. It would incur a
cross-register-class copy which is not cheap.

Cheers,

James

On 18 July 2014 16:25, Stephen Canon <scanon at apple.com> wrote:

> James raises an interesting point with regard to register pressure.  There
> is *way* more available volatile register state in SIMD than there is in
> GPR under the 64-bit PCS, and code with heavy SIMD register pressure
> usually doesn’t involve much copying.  For small object copies (the sort
> that are likely to get inlined) the cost of a spill is a significant
> portion of the cost of a copy; I expect this to tip the balance *very
> slightly* in favor of using the SIMD registers on Cyclone (I’m a bit
> surprised if it doesn’t on A57 as well, though this is very definitely a
> second-order effect).
>
> – Steve
>
> On Jul 16, 2014, at 12:50 PM, Juergen Ributzka <juergen at apple.com> wrote:
>
> CC Jim and Tim
> On Jul 16, 2014, at 7:46 AM, James Molloy <james at jamesmolloy.co.uk> wrote:
>
> Hi Sergey,
>
> Also, generated code doesn't always like "ldp; stp; ldp; stp", but can
>> be either of form:
>>   ldp
>>   ...something...
>>   stp
>> or:
>>   ldp
>>   ldp
>>   stp
>>   stp
>> Is it OK for performance?
>
>
> I'm afraid that would be a performance regression. That sequence requires
> twice the number of live registers. For performance it is important to have
> the ldp and stp properly interleaved, both on A57 and A53.
>
> I also suspect that f128 logic was put in for Cyclone so Apple might get
> annoyed if you remove it ;)
>
> Cheers,
>
> James
>
>
> On 16 July 2014 15:33, Sergey Dmitrouk <sdmitrouk at accesssoftek.com> wrote:
>
>> Hi again,
>>
>> I was going through code looking what's wrong with generic load&store
>> generation, but found the issue in test.  Using "undef" as destination
>> for "llvm.memcpy" as it was in other test is bad idea, code generator
>> actually produces meaningless code in this case.  Destination pointer
>> is not advanced for store instructions (offset is just ignored), which
>> prevents ldp/stp pass to squish sequences of ldr/str instructions.
>>
>> So I fixed the test and this time provide two patches: one for
>> Cortex-A53 only and the other one for all subtargets.  Take a look at
>> the second one, I don't really like that it basically removes code for
>> 128-bit types, maybe there is a reason why it was put there.
>>
>> Also, generated code doesn't always like "ldp; stp; ldp; stp", but can
>> be either of form:
>>   ldp
>>   ...something...
>>   stp
>> or:
>>   ldp
>>   ldp
>>   stp
>>   stp
>> Is it OK for performance?
>>
>> Regards,
>> Sergey
>>
>> On Wed, Jul 16, 2014 at 12:08:51AM +0300, Sergey Dmitrouk wrote:
>> > Hi James,
>> >
>> > Sorry, I missed that difference.  It shouldn't be hard to do that ideal
>> > sequence for all targets though.  One should provide custom
>> > implementation for AArch64SelectionDAGInfo::EmitTargetCodeForMemcpy,
>> which
>> > was my initial way of doing this task.  Code in that callback can
>> generate
>> > any instructions, so I'll try to do that.
>> >
>> > The only issue is that SelectionDAG prefers its own load&store
>> generator and
>> > invokes it before the custom one.  I don't know a good way to overcome
>> > that at the moment.  It's possible to set unreasonable limits for
>> > instructions emitted by generic load&store implementation, which will
>> cause
>> > SelectionDAG to reject it, but it's a hack, which I'd prefer not to
>> > implement.  There should be a better way.
>> >
>> > Thanks & Cheers,
>> > Sergey
>> >
>> > On Tue, Jul 15, 2014 at 12:54:52PM -0700, James Molloy wrote:
>> > >    Hi Sergey,
>> > >    Thanks for working on this! The answer is slightly more involved
>> though, I
>> > >    think.
>> > >    As shown in your testcase, your change emits the sequence "ldr;
>> str; ldr;
>> > >    str". The ideal expansion is "ldp; stp; ldp; stp;". That way we
>> still do
>> > >    128-bit loads and stores.
>> > >    In fact, our microarchitects have recommended (through internal
>> channels)
>> > >    that the "ldp; stp" sequence be used for memcpy-like operations -
>> this
>> > >    will give portable performance. Therefore, the change should also
>> be made
>> > >    for at least A57. I'll let Tim or Jim comment on Cyclone.
>> > >    So to generate "ldp stp", the inline memcpy expander needs to
>> generate
>> > >    "ldr; ldr; str; str;". The ldp/stp pass will then squish these
>> together.
>> > >    A similar thing is done in the ARM target (which gets combined
>> into LDRD
>> > >    or LDM), but it's ARM-only. I think some logic needs to me moved
>> into the
>> > >    target-independent part of codegen.
>> > >    Cheers,
>> > >    James
>> > >
>> > >    On 15 July 2014 09:15, Sergey Dmitrouk <sdmitrouk at accesssoftek.com>
>> wrote:
>> > >
>> > >      Hi,
>> > >
>> > >      Basing on the following information from [this post][0] by James
>> Molloy:
>> > >
>> > >      A  * Our inline memcpy expansion pass is emitting "LDR q0, [..];
>> STR q0,
>> > >      A  [..]" pairs, which is less than ideal on A53. If we switched
>> to
>> > >      A  emitting "LDP x0, x1, [..]; STP x0, x1, [..]", we'd get
>> around 30%
>> > >      A  better inline memcpy performance on A53. A57 seems to deal
>> well with
>> > >      A  the LDR q sequence.
>> > >
>> > >      I've made a patch (attached) that does this for Cortex-A53. A
>> Please
>> > >      take a look at it.
>> > >
>> > >      Best regards,
>> > >      Sergey
>> > >
>> > >      0:
>> http://article.gmane.org/gmane.comp.compilers.llvm.devel/74269
>> > >
>> > >      _______________________________________________
>> > >      llvm-commits mailing list
>> > >      llvm-commits at cs.uiuc.edu
>> > >      http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140718/79bbc70e/attachment.html>