[PATCH] D26191: [ARM] Patch to improve memcpy lined assembly sequence.

Tue Nov 1 07:11:12 PDT 2016

rs created this revision.
rs added reviewers: rengolin, t.p.northover.
rs added a subscriber: llvm-commits.
Herald added subscribers: mgorny, aemerson.

memcpy's which are <= 64 (see ARMSubtarget::getMaxInlineSizeThreshold) can be inlined into a sequence of loads/stores, if the memcpy number of bytes is greater than 64 then the memcpy library function is used instead.

For a copy where the number of bytes is a multiple of 4 bytes the memcpy inling function can generate full word loads for loading the source using the ldr instruction or loading multiple words using the ldm instruction and then storing the words to the destination using str or storing multiple words with stm. The optimal sequence in most cases is the one where multiple words are loaded and stored using ldm/stm as doing a single ldm/stm is faster than doing 1 word loads and stores.

When the number of bytes to copy isn't a multiple of 4 then memcpy inling will end up using ldrb/strb if the remainder is 1, lrdh/strh if the remainder is 2 and if it's 3 bytes then the backend will generate ldrb/strb/lrdh/strh. If the number of bytes was a multiple of 4 then the backend would have been able to collapse the load/store in a previous ldm/stm instruction or just do a ldr/str, in the case when the remaining bytes is 1 or 2 then the ldr/str will take the same time as the 1 byte ldrb/strb or the 2 byte ldrb/strb but when it's 3 bytes and you generate ldrb/strb/ldrh/strh then the ldr/str is much better. Also even if the remainder is 1 or 2 bytes and you're copying multiple words then it's possible the backend has already decided to collapse the load/stores in a multi load/store operation using ldm/stm.

This patch tries to implement this simple optimization by padding the destination, source and increasing the number of bytes to be a multiple of 4 words when doing a memcpy operation.

The patch implements a pass that looks for the memcpy intrinsic and uses the simple herustic below to decide whether to pad the dest/source or not:

1. Is the destination a stack allocated constant array ?
2. Is the source a constant ?
3. Is the number of bytes to copy a constant ?
4. Is destination array size == constant source size == number of bytes

If answer to those questions is yes then the pass pads the destination/source and increases the number of bytes to copy in the memcpy operation.

The pass is implemented as a midend IR level pass but is only added when the target is an ARM 32 bit core. The reason it's implemented as a
midend pass instead of a backend pass or implementing it in ARMTargetLowering::getOptimalMemOpType or ARMSelectionDAGInfo::EmitTargetCodeForMemcpy is because on previous attempts I've found that it wasn't possible to pad the source/destination as the IR objects at that level were immutable.  It might be possible for me to pad the SelectionDag nodes but I would have to implement some analysis to make sure it's safe to do so. Maybe adding a midend analysis pass which would propagate information to the backend information about which memcpy's are safe to pad might work ?  If you want to see my previous attempt at doing it in ARMSelectionDAGInfo::EmitTargetCodeForMemcpy then let me know. Ideally I would have liked to have done this optimization close to where the memcpy was going to be inlined.

https://reviews.llvm.org/D26191

Files:
  lib/Target/ARM/ARM.h
  lib/Target/ARM/ARMPadMemcpyPass.cpp
  lib/Target/ARM/ARMTargetMachine.cpp
  lib/Target/ARM/CMakeLists.txt
  test/CodeGen/ARM/arm-pad-memcpy-lengths-dont-match.ll
  test/CodeGen/ARM/arm-pad-memcpy-more-than-64-bytes.ll
  test/CodeGen/ARM/arm-pad-memcpy-strings-test1.ll
  test/CodeGen/ARM/arm-pad-memcpy-strings-test2.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D26191.76552.patch
Type: text/x-patch
Size: 19006 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161101/87cd0bc8/attachment-0001.bin>