[PATCH][AArch64] Use 8-byte load&store for inlined memcpy() on Cortex A53

Wed Jul 16 07:33:56 PDT 2014

Hi again,

I was going through code looking what's wrong with generic load&store
generation, but found the issue in test.  Using "undef" as destination
for "llvm.memcpy" as it was in other test is bad idea, code generator
actually produces meaningless code in this case.  Destination pointer
is not advanced for store instructions (offset is just ignored), which
prevents ldp/stp pass to squish sequences of ldr/str instructions.

So I fixed the test and this time provide two patches: one for
Cortex-A53 only and the other one for all subtargets.  Take a look at
the second one, I don't really like that it basically removes code for
128-bit types, maybe there is a reason why it was put there.

Also, generated code doesn't always like "ldp; stp; ldp; stp", but can
be either of form:
  ldp
  ...something...
  stp
or:
  ldp
  ldp
  stp
  stp
Is it OK for performance?

Regards,
Sergey

On Wed, Jul 16, 2014 at 12:08:51AM +0300, Sergey Dmitrouk wrote:
> Hi James,
> 
> Sorry, I missed that difference.  It shouldn't be hard to do that ideal
> sequence for all targets though.  One should provide custom
> implementation for AArch64SelectionDAGInfo::EmitTargetCodeForMemcpy, which
> was my initial way of doing this task.  Code in that callback can generate
> any instructions, so I'll try to do that.
> 
> The only issue is that SelectionDAG prefers its own load&store generator and
> invokes it before the custom one.  I don't know a good way to overcome
> that at the moment.  It's possible to set unreasonable limits for
> instructions emitted by generic load&store implementation, which will cause
> SelectionDAG to reject it, but it's a hack, which I'd prefer not to
> implement.  There should be a better way.
> 
> Thanks & Cheers,
> Sergey
> 
> On Tue, Jul 15, 2014 at 12:54:52PM -0700, James Molloy wrote:
> >    Hi Sergey,
> >    Thanks for working on this! The answer is slightly more involved though, I
> >    think.
> >    As shown in your testcase, your change emits the sequence "ldr; str; ldr;
> >    str". The ideal expansion is "ldp; stp; ldp; stp;". That way we still do
> >    128-bit loads and stores.
> >    In fact, our microarchitects have recommended (through internal channels)
> >    that the "ldp; stp" sequence be used for memcpy-like operations - this
> >    will give portable performance. Therefore, the change should also be made
> >    for at least A57. I'll let Tim or Jim comment on Cyclone.
> >    So to generate "ldp stp", the inline memcpy expander needs to generate
> >    "ldr; ldr; str; str;". The ldp/stp pass will then squish these together.
> >    A similar thing is done in the ARM target (which gets combined into LDRD
> >    or LDM), but it's ARM-only. I think some logic needs to me moved into the
> >    target-independent part of codegen.
> >    Cheers,
> >    James
> > 
> >    On 15 July 2014 09:15, Sergey Dmitrouk <sdmitrouk at accesssoftek.com> wrote:
> > 
> >      Hi,
> > 
> >      Basing on the following information from [this post][0] by James Molloy:
> > 
> >      A  * Our inline memcpy expansion pass is emitting "LDR q0, [..]; STR q0,
> >      A  [..]" pairs, which is less than ideal on A53. If we switched to
> >      A  emitting "LDP x0, x1, [..]; STP x0, x1, [..]", we'd get around 30%
> >      A  better inline memcpy performance on A53. A57 seems to deal well with
> >      A  the LDR q sequence.
> > 
> >      I've made a patch (attached) that does this for Cortex-A53. A Please
> >      take a look at it.
> > 
> >      Best regards,
> >      Sergey
> > 
> >      0: http://article.gmane.org/gmane.comp.compilers.llvm.devel/74269
> > 
> >      _______________________________________________
> >      llvm-commits mailing list
> >      llvm-commits at cs.uiuc.edu
> >      http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
-------------- next part --------------
[AArch64] 64-bit load&store memcpy() on Cortex-A53

Cortex-A53 executes memcpy inlined with 64-bit registers faster than
version that uses 128-bit registers. Prefer 64-bit registers for memcpy
inlining for Cortex-A53.

diff --git a/lib/Target/AArch64/AArch64ISelLowering.cpp b/lib/Target/AArch64/AArch64ISelLowering.cpp
index 7b77c59..9cb08b3 100644
--- a/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -367,8 +367,15 @@ AArch64TargetLowering::AArch64TargetLowering(TargetMachine &TM)
   setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);
 
   MaxStoresPerMemset = MaxStoresPerMemsetOptSize = 8;
-  MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = 4;
   MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = 4;
+  // It's better to use 64-bit registers for Cortex-A53 for memcpy inlining.
+  // 128-bit registers are optimal for other CPUs. Thus maximum number of stores
+  // for Cortex-A53 must be twice as big as for other subtargets.
+  if (Subtarget->isCortexA53()) {
+    MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = 8;
+  } else {
+    MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = 4;
+  }
 
   setStackPointerRegisterToSaveRestore(AArch64::SP);
 
@@ -6155,17 +6162,22 @@ EVT AArch64TargetLowering::getOptimalMemOpType(uint64_t Size, unsigned DstAlign,
                                                bool ZeroMemset,
                                                bool MemcpyStrSrc,
                                                MachineFunction &MF) const {
-  // Don't use AdvSIMD to implement 16-byte memset. It would have taken one
-  // instruction to materialize the v2i64 zero and one store (with restrictive
-  // addressing mode). Just do two i64 store of zero-registers.
-  bool Fast;
-  const Function *F = MF.getFunction();
-  if (Subtarget->hasFPARMv8() && !IsMemset && Size >= 16 &&
-      !F->getAttributes().hasAttribute(AttributeSet::FunctionIndex,
-                                       Attribute::NoImplicitFloat) &&
-      (memOpAlign(SrcAlign, DstAlign, 16) ||
-       (allowsUnalignedMemoryAccesses(MVT::f128, 0, &Fast) && Fast)))
-    return MVT::f128;
+  // Cortex-A53 executes memcpy inlined with 64-bit registers faster than
+  // version that uses 128-bit registers. Prefer 64-bit registers for memcpy
+  // inlining for Cortex-A53.
+  if (!Subtarget->isCortexA53() || IsMemset) {
+    // Don't use AdvSIMD to implement 16-byte memset. It would have taken one
+    // instruction to materialize the v2i64 zero and one store (with restrictive
+    // addressing mode). Just do two i64 store of zero-registers.
+    bool Fast;
+    const Function *F = MF.getFunction();
+    if (Subtarget->hasFPARMv8() && !IsMemset && Size >= 16 &&
+        !F->getAttributes().hasAttribute(AttributeSet::FunctionIndex,
+                                         Attribute::NoImplicitFloat) &&
+        (memOpAlign(SrcAlign, DstAlign, 16) ||
+         (allowsUnalignedMemoryAccesses(MVT::f128, 0, &Fast) && Fast)))
+      return MVT::f128;
+  }
 
   return Size >= 8 ? MVT::i64 : MVT::i32;
 }
diff --git a/lib/Target/AArch64/AArch64Subtarget.h b/lib/Target/AArch64/AArch64Subtarget.h
index 52124f6..3a865af 100644
--- a/lib/Target/AArch64/AArch64Subtarget.h
+++ b/lib/Target/AArch64/AArch64Subtarget.h
@@ -99,6 +99,8 @@ public:
 
   bool isTargetMachO() const { return TargetTriple.isOSBinFormatMachO(); }
 
+  bool isCortexA53() const { return CPUString == "cortex-a53"; }
+
   bool isCyclone() const { return CPUString == "cyclone"; }
 
   /// getMaxInlineSizeThreshold - Returns the maximum memset / memcpy size
diff --git a/test/CodeGen/AArch64/64bit-load-store-memcpy-on-cortex-a53.ll b/test/CodeGen/AArch64/64bit-load-store-memcpy-on-cortex-a53.ll
new file mode 100644
index 0000000..360e885
--- /dev/null
+++ b/test/CodeGen/AArch64/64bit-load-store-memcpy-on-cortex-a53.ll
@@ -0,0 +1,21 @@
+; RUN: llc < %s -march=aarch64 -mtriple=aarch64-linux-gnu -mcpu=cortex-a53 | FileCheck %s
+
+; marked as external to prevent possible optimizations
+ at a = external global [4 x i32]
+ at b = external global [4 x i32]
+
+define void @copy-16-bytes-with-8-byte-registers() {
+; CHECK-LABEL: @copy-16-bytes-with-8-byte-registers
+; CHECK: adrp
+; CHECK: add
+; CHECK: ldp
+; CHECK: adrp
+; CHECK: add
+; CHECK: stp
+; CHECK: ret
+entry:
+  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* bitcast ([4 x i32]* @a to i8*), i8* bitcast ([4 x i32]* @b to i8*), i64 16, i32 8, i1 false)
+  ret void
+}
+
+declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture, i8* nocapture readonly, i64, i32, i1)
-------------- next part --------------
[AArch64] 64bit pair load&store for inlined memcpy

It's optimal to use 64-bit registers with load/store pair instructions
for memcpy() inlining, rather than doing the same with regular
load/store instructions operating on 128-bit registers. Do not
use 128-bit types.

diff --git a/lib/Target/AArch64/AArch64ISelLowering.cpp b/lib/Target/AArch64/AArch64ISelLowering.cpp
index 7b77c59..34c43e7 100644
--- a/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -367,7 +367,11 @@ AArch64TargetLowering::AArch64TargetLowering(TargetMachine &TM)
   setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);
 
   MaxStoresPerMemset = MaxStoresPerMemsetOptSize = 8;
-  MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = 4;
+  // It's optimal to use 64-bit registers with load/store pair instructions for
+  // memcpy() inlining, rather than doing the same with regular load/store
+  // instructions operating on 128-bit registers.  Allow twice as big
+  // instructions af for memmove().
+  MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = 8;
   MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = 4;
 
   setStackPointerRegisterToSaveRestore(AArch64::SP);
@@ -6155,18 +6159,9 @@ EVT AArch64TargetLowering::getOptimalMemOpType(uint64_t Size, unsigned DstAlign,
                                                bool ZeroMemset,
                                                bool MemcpyStrSrc,
                                                MachineFunction &MF) const {
-  // Don't use AdvSIMD to implement 16-byte memset. It would have taken one
-  // instruction to materialize the v2i64 zero and one store (with restrictive
-  // addressing mode). Just do two i64 store of zero-registers.
-  bool Fast;
-  const Function *F = MF.getFunction();
-  if (Subtarget->hasFPARMv8() && !IsMemset && Size >= 16 &&
-      !F->getAttributes().hasAttribute(AttributeSet::FunctionIndex,
-                                       Attribute::NoImplicitFloat) &&
-      (memOpAlign(SrcAlign, DstAlign, 16) ||
-       (allowsUnalignedMemoryAccesses(MVT::f128, 0, &Fast) && Fast)))
-    return MVT::f128;
-
+  // It's optimal to use 64-bit registers with load/store pair instructions for
+  // memcpy() inlining, rather than doing the same with regular load/store
+  // instructions operating on 128-bit registers. Do not use 128-bit types.
   return Size >= 8 ? MVT::i64 : MVT::i32;
 }
 
diff --git a/test/CodeGen/AArch64/64bit-load-store-pairs-for-inlined-memcpy.ll b/test/CodeGen/AArch64/64bit-load-store-pairs-for-inlined-memcpy.ll
new file mode 100644
index 0000000..360e885
--- /dev/null
+++ b/test/CodeGen/AArch64/64bit-load-store-pairs-for-inlined-memcpy.ll
@@ -0,0 +1,21 @@
+; RUN: llc < %s -march=aarch64 -mtriple=aarch64-linux-gnu | FileCheck %s
+
+; marked as external to prevent possible optimizations
+ at a = external global [4 x i32]
+ at b = external global [4 x i32]
+
+define void @copy-16-bytes-with-8-byte-registers() {
+; CHECK-LABEL: @copy-16-bytes-with-8-byte-registers
+; CHECK: adrp
+; CHECK: add
+; CHECK: ldp
+; CHECK: adrp
+; CHECK: add
+; CHECK: stp
+; CHECK: ret
+entry:
+  tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* bitcast ([4 x i32]* @a to i8*), i8* bitcast ([4 x i32]* @b to i8*), i64 16, i32 8, i1 false)
+  ret void
+}
+
+declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture, i8* nocapture readonly, i64, i32, i1)