[PATCH] Thumb2: Modify codegen for memcpy intrinsic to prefer LDM/STM.

John Brawn john.brawn at arm.com
Wed May 6 05:48:59 PDT 2015


This looks like a generally good idea to me, though there are some points that need to be looked at (in the inline comments).

If this were to be committed it does make me wonder what should become of the load/store optimizer though. Perhaps ARMPreAllocLoadStoreOpt should be made to introduce MCOPY? That's something for another day though.


================
Comment at: lib/Target/ARM/ARMISelLowering.cpp:7567
@@ +7566,3 @@
+                       SDNode *Node) {
+  bool isThumb2 = Subtarget->isThumb2();
+  const ARMBaseInstrInfo *TII = Subtarget->getInstrInfo();
----------------
The assumption here is that !isThumb2 means ARM, but it could also mean Thumb1. This means for a Thumb1 target we emit invalid instructions. Either LowerMCOPY should handle Thumb1, or we shouldn't be turning memcpy into MCOPY for Thumb1.

================
Comment at: lib/Target/ARM/ARMSelectionDAGInfo.cpp:28
@@ -27,3 +27,3 @@
 SDValue
 ARMSelectionDAGInfo::EmitTargetCodeForMemcpy(SelectionDAG &DAG, SDLoc dl,
                                              SDValue Chain,
----------------
EmitTargetCodeForMemcpy is called by SelectionDAG::getMemcpy only when getMemcpyLoadsAndStores fails to generate a load/store sequence. ARMTargetLowering::ARMTargetLowering currently sets MaxStoresPerMemcpy to 4, so this function will only be triggered for memcpys of >16 bytes. If MCOPY gives better results than individual loads and stores then maybe that should be lowered to 0 so that this function is always used?

================
Comment at: lib/Target/ARM/ARMSelectionDAGInfo.cpp:67-68
@@ +66,4 @@
+  while (EmittedNumMemOps < NumMemOps - 1) {
+    // Use up to MAX_LOADS_IN_LDM registers per mcopy. The mcopys will get
+    // lowered into ldm/stm later on.
+    unsigned NumRegs = std::min(MAX_LOADS_IN_LDM, NumMemOps - EmittedNumMemOps);
----------------
If the number of words to be copied is not an exact multiple of MAX_LOADS_IN_LDM, then splitting it up in this way may not be the best idea. Consider, for example, a copy of 7 words. Splitting it into 6+1 means that the total registers that need to be available is 8 (source, dest, 6 reg list), but if we were to split it up as 3+4 then the total registers that need to be available is 6 (or maybe 5 if the source and dest are dead after the memcpy). That would reduce register pressure and in some cases allow fewer callee-saved registers to need to be saved.

Of course that's the current behaviour as well, but lumping everything together into an MCOPY may make things harder for the register allocator where it may have had more freedom in the case of individual loads and stores, but I don't know enough about LLVM's register allocation to know if that's actually true or not, or if it ever turns out to be a problem.

================
Comment at: lib/Target/ARM/ARMSelectionDAGInfo.cpp:87
@@ -96,3 +86,3 @@
 
-  // Issue loads / stores for the trailing (1 - 3) bytes.
+  // Issue loads / stores for the trailing (1 - 7) bytes.
   unsigned BytesLeftSave = BytesLeft;
----------------
I was a bit confused about why this is 1-7 and not 1-3 like before, but it looks like you can get more than 3 trailing bytes when (SizeVal % (MAX_LOADS_IN_LDM*4)) is between 4 and 7, i.e. some number of MCOPYs of MAX_LOADS_IN_LDM size are emitted then we don't want a 1-byte LDM so we have 7 bytes left.

Maybe it would be clearer if the calculation of BytesLeft were done in one go after the MCOPY generation instead of split around it. And possibly a more explanatory comment.

http://reviews.llvm.org/D9508

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/






More information about the llvm-commits mailing list