[libc-commits] [PATCH] D92236: [LIBC] Add optimized memcpy routine for AArch64

Wed Jan 20 02:14:24 PST 2021

avieira updated this revision to Diff 317812.
avieira added a comment.

Hi,

So here is an updated version for an optimized memcpy routine for AArch64. This one basically uses the same as the default memcpy, but picks a different block size and alignment for copies > 128.
I also disable tail merging as I found it was leading to worse code. This new memcpy seems to show improvements accross the board for both sweep and distribution benchmarks.

I am continuing to investigate a better organization of the copies smaller than 128bytes, as I had before, using the new benchmarks. Using the same code I had before I am seeing an improvement in Uniform1024 (new uniform distribution I added for sizes 0-1024), I also see an improvement in Memcpy Distributions A, M, Q and U, but a regression in B, L, S and W. For distribution D the optimized version beats the older version but shows a regression compared to the version in this patch.

I'll spend a few extra cycles trying to see if I can find a sweet spot, but I might leave it like this.

Is this OK for main?

Also I have two patches downstream for:

1. Uniform1024 distribution, an uniform distribution for sizes 0-1024
2. Options to define a Sweep 'min size' and 'step'.

Let me know if you are interested in either of these.

Kind regards,
Andre


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D92236/new/

https://reviews.llvm.org/D92236

Files:
  libc/src/string/CMakeLists.txt
  libc/src/string/aarch64/CMakeLists.txt
  libc/src/string/aarch64/memcpy.cpp


Index: libc/src/string/aarch64/memcpy.cpp
===================================================================

--- /dev/null
+++ libc/src/string/aarch64/memcpy.cpp
@@ -0,0 +1,67 @@
+//===-- Implementation of memcpy ------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "src/string/memcpy.h"
+#include "src/__support/common.h"
+#include "src/string/memory_utils/memcpy_utils.h"
+
+namespace __llvm_libc {
+
+// Design rationale
+// ================
+//
+// Using a profiler to observe size distributions for calls into libc
+// functions, it was found most operations act on a small number of bytes.
+// This makes it important to favor small sizes.
+//
+// We have used __builtin_expect to tell the compiler to favour lower sizes as
+// that will reduce the branching overhead where that would hurt most
+// proportional to total cost of copying.
+//
+// The function is written in C++ for several reasons:
+// - The compiler can __see__ the code, this is useful when performing Profile
+//   Guided Optimization as the optimized code can take advantage of branching
+//   probabilities.
+// - It also allows for easier customization and favors testing multiple
+//   implementation parameters.
+// - As compilers and processors get better, the generated code is improved
+//   with little change on the code side.
+static void memcpy_aarch64(char *__restrict dst, const char *__restrict src,
+                           size_t count) {
+  if (count == 0)
+    return;
+  if (count == 1)
+    return CopyBlock<1>(dst, src);
+  if (count == 2)
+    return CopyBlock<2>(dst, src);
+  if (count == 3)
+    return CopyBlock<3>(dst, src);
+  if (count == 4)
+    return CopyBlock<4>(dst, src);
+  if (count < 8)
+    return CopyBlockOverlap<4>(dst, src, count);
+  if (count < 16)
+    return CopyBlockOverlap<8>(dst, src, count);
+  if (count < 32)
+    return CopyBlockOverlap<16>(dst, src, count);
+  if (count < 64)
+    return CopyBlockOverlap<32>(dst, src, count);
+  if (count < 128)
+    return CopyBlockOverlap<64>(dst, src, count);
+  return CopyAlignedBlocks<64,16>(dst, src, count);
+}
+
+LLVM_LIBC_FUNCTION(void *, memcpy,
+                   (void *__restrict dst, const void *__restrict src,
+                    size_t size)) {
+  memcpy_aarch64(reinterpret_cast<char *>(dst),
+                 reinterpret_cast<const char *>(src), size);
+  return dst;
+}
+
+} // namespace __llvm_libc
Index: libc/src/string/aarch64/CMakeLists.txt
===================================================================
--- /dev/null
+++ libc/src/string/aarch64/CMakeLists.txt
@@ -0,0 +1 @@
+add_memcpy("memcpy_${LIBC_TARGET_MACHINE}")
Index: libc/src/string/CMakeLists.txt
===================================================================
--- libc/src/string/CMakeLists.txt
+++ libc/src/string/CMakeLists.txt
@@ -215,6 +215,11 @@
 if(${LIBC_TARGET_MACHINE} STREQUAL "x86_64")
   set(LIBC_STRING_TARGET_ARCH "x86")
   set(MEMCPY_SRC ${LIBC_SOURCE_DIR}/src/string/x86/memcpy.cpp)
+elseif(${LIBC_TARGET_MACHINE} STREQUAL "aarch64")
+  set(LIBC_STRING_TARGET_ARCH "aarch64")
+  set(MEMCPY_SRC ${LIBC_SOURCE_DIR}/src/string/aarch64/memcpy.cpp)
+#Disable tail merging as it leads to lower performance
+  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mllvm --tail-merge-threshold=0")
 else()
   set(LIBC_STRING_TARGET_ARCH ${LIBC_TARGET_MACHINE})
   set(MEMCPY_SRC ${LIBC_SOURCE_DIR}/src/string/memcpy.cpp)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D92236.317812.patch
Type: text/x-patch
Size: 3689 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/libc-commits/attachments/20210120/7d949c1f/attachment-0001.bin>