[PATCH] Thumb2: Modify codegen for memcpy intrinsic to prefer LDM/STM.

Tue May 26 16:57:48 PDT 2015

================
Comment at: lib/Target/ARM/ARMISelLowering.cpp:7567
@@ +7566,3 @@
+                       SDNode *Node) {
+  bool isThumb2 = Subtarget->isThumb2();
+  const ARMBaseInstrInfo *TII = Subtarget->getInstrInfo();
----------------
john.brawn wrote:
> The assumption here is that !isThumb2 means ARM, but it could also mean Thumb1. This means for a Thumb1 target we emit invalid instructions. Either LowerMCOPY should handle Thumb1, or we shouldn't be turning memcpy into MCOPY for Thumb1.
Done

================
Comment at: lib/Target/ARM/ARMSelectionDAGInfo.cpp:28
@@ -27,3 +27,3 @@
 SDValue
 ARMSelectionDAGInfo::EmitTargetCodeForMemcpy(SelectionDAG &DAG, SDLoc dl,
                                              SDValue Chain,
----------------
john.brawn wrote:
> EmitTargetCodeForMemcpy is called by SelectionDAG::getMemcpy only when getMemcpyLoadsAndStores fails to generate a load/store sequence. ARMTargetLowering::ARMTargetLowering currently sets MaxStoresPerMemcpy to 4, so this function will only be triggered for memcpys of >16 bytes. If MCOPY gives better results than individual loads and stores then maybe that should be lowered to 0 so that this function is always used?
If the memcpy is short enough (and the target supports it) getMemcpyLoadsAndStores use the extension registers, which normally ends up being shorter due to less pressure on the general-purpose registers, so I left this as is in order to take advantage of that.

(Ideally I think we should have something like a VMCOPY pseudo-instruction that lowers to VLDM/VSTM, but this seems harder to map onto the register allocator's view of the world, as VLDM/VSTM take a consecutive register range.)

================
Comment at: lib/Target/ARM/ARMSelectionDAGInfo.cpp:67-68
@@ +66,4 @@
+  while (EmittedNumMemOps < NumMemOps - 1) {
+    // Use up to MAX_LOADS_IN_LDM registers per mcopy. The mcopys will get
+    // lowered into ldm/stm later on.
+    unsigned NumRegs = std::min(MAX_LOADS_IN_LDM, NumMemOps - EmittedNumMemOps);
----------------
john.brawn wrote:
> If the number of words to be copied is not an exact multiple of MAX_LOADS_IN_LDM, then splitting it up in this way may not be the best idea. Consider, for example, a copy of 7 words. Splitting it into 6+1 means that the total registers that need to be available is 8 (source, dest, 6 reg list), but if we were to split it up as 3+4 then the total registers that need to be available is 6 (or maybe 5 if the source and dest are dead after the memcpy). That would reduce register pressure and in some cases allow fewer callee-saved registers to need to be saved.
> 
> Of course that's the current behaviour as well, but lumping everything together into an MCOPY may make things harder for the register allocator where it may have had more freedom in the case of individual loads and stores, but I don't know enough about LLVM's register allocation to know if that's actually true or not, or if it ever turns out to be a problem.
The code now calculates the number of LDM/STMs we need anyway, and divides the registers evenly among them.

> Of course that's the current behaviour as well, but lumping everything together into an MCOPY may make things harder for the register allocator where it may have had more freedom in the case of individual loads and stores, but I don't know enough about LLVM's register allocation to know if that's actually true or not, or if it ever turns out to be a problem.

I'd be surprised if it were a problem right now. The register allocator is basically solving the same problem except distributed over a smaller number of instructions.

================
Comment at: lib/Target/ARM/ARMSelectionDAGInfo.cpp:87
@@ -96,3 +86,3 @@
 
-  // Issue loads / stores for the trailing (1 - 3) bytes.
+  // Issue loads / stores for the trailing (1 - 7) bytes.
   unsigned BytesLeftSave = BytesLeft;
----------------
john.brawn wrote:
> I was a bit confused about why this is 1-7 and not 1-3 like before, but it looks like you can get more than 3 trailing bytes when (SizeVal % (MAX_LOADS_IN_LDM*4)) is between 4 and 7, i.e. some number of MCOPYs of MAX_LOADS_IN_LDM size are emitted then we don't want a 1-byte LDM so we have 7 bytes left.
> 
> Maybe it would be clearer if the calculation of BytesLeft were done in one go after the MCOPY generation instead of split around it. And possibly a more explanatory comment.
> I was a bit confused about why this is 1-7 and not 1-3 like before, but it looks like you can get more than 3 trailing bytes when (SizeVal % (MAX_LOADS_IN_LDM*4)) is between 4 and 7, i.e. some number of MCOPYs of MAX_LOADS_IN_LDM size are emitted then we don't want a 1-byte LDM so we have 7 bytes left.

Right. Now that we're distributing the registers among the MCOPYs, we save on overall register pressure when emitting the extra LDM/STM pair.

http://reviews.llvm.org/D9508

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/