[PATCH] [AArch64] Inline memcpy() as a sequence of ldp-stp with 64-bit registers

Tue Nov 4 00:38:09 PST 2014

> It sounds much more like we're hitting a scheduler defect that we want to fix properly instead, unless it's a constraint that's just impossible to represent.

The scheduler seems to do its job correctly for generic case, but it seems to be missing
information about instruction operands. In this case it could ignore latency of `ldp` when
it's followed by `stp` with same operands. 

> Perhaps some kind of forwarding from a load to a dependent store has been omitted?

I tried gluing and/or combining nodes in a lot of ways, scheduler doesn't care about any
of these. Another way would be to introduce pseudo-instruction and expand it after
scheduling, but it requires temporary registers and its too late to allocate registers
at that point.

Regards,
Sergey

================
Comment at: lib/Target/AArch64/AArch64ISelLowering.cpp:6615-6620
@@ -6606,12 +6614,8 @@
                                                MachineFunction &MF) const {
-  // Don't use AdvSIMD to implement 16-byte memset. It would have taken one
-  // instruction to materialize the v2i64 zero and one store (with restrictive
-  // addressing mode). Just do two i64 store of zero-registers.
-  bool Fast;
-  const Function *F = MF.getFunction();
-  if (Subtarget->hasFPARMv8() && !IsMemset && Size >= 16 &&
-      !F->getAttributes().hasAttribute(AttributeSet::FunctionIndex,
-                                       Attribute::NoImplicitFloat) &&
-      (memOpAlign(SrcAlign, DstAlign, 16) ||
-       (allowsMisalignedMemoryAccesses(MVT::f128, 0, 1, &Fast) && Fast)))
-    return MVT::f128;
+  // In general it's optimal to use 64-bit registers with load/store pair
+  // instructions for memcpy() inlining, rather than doing the same with regular
+  // load/store instructions operating on 128-bit registers. Do not use 128-bit
+  // types.
+
+  if (Subtarget->isCyclone()) {
+    // Don't use AdvSIMD to implement 16-byte memset. It would have taken one
----------------
t.p.northover wrote:
> How general is this? We should be writing for future cores as well as existing ones, and always preferring 64-bit operations seems like it'll be more and more of an oddity in future.
> 
> It also seems like it belongs in a completely separate patch to the interleaving one.
> How general is this? We should be writing for future cores as well as existing ones, and always preferring 64-bit operations seems like it'll be more and more of an oddity in future.

If I get it right, the problem with 128-bit registers is that they are floating point registers rather than general purpose ones, so as long as there is no 128-bit GP registers, this should hold.

> It also seems like it belongs in a completely separate patch to the interleaving one.

The pass is there to allow better code generation for `memcpy()`, but they can be separated technically (pass can go first, maybe with slightly changed tests).

================
Comment at: lib/Target/AArch64/AArch64LoadStoreInterleave.cpp:242-245
@@ +241,6 @@
+// Checks if a set of load and store instructions can be safely reordered.
+static bool isSafeToReorder(MachineBasicBlock &MBB,
+                            const SmallVectorImpl<MachineInstr*> &Lds,
+                            const SmallVectorImpl<MachineInstr*> &Sts,
+                            const TargetRegisterInfo *TRI) {
+  if (Sts.empty() || Sts.size() != Lds.size())
----------------
t.p.northover wrote:
> This seems like a really fragile way to do this. It's only ever going to work on a basic block with a single memcpy operation and no other loads/stores.
> It's only ever going to work on a basic block with a single memcpy operation and no other loads/stores.

The condition isn't exactly this one, but it does have similar constrain. Next revision that works on pairs of instructions should change this.

http://reviews.llvm.org/D6054