[libc-commits] [libc] [libc][x86] copy one cache line at a time to prevent the use of `rep; movsb` (PR #113161)

Mon Oct 21 06:04:14 PDT 2024

https://github.com/gchatelet created https://github.com/llvm/llvm-project/pull/113161

When using `-mprefer-vector-width=128` with `-march=sandybridge` copying 3 cache lines in one go (192B) gets converted into `rep;movsb` which translate into a 60% hit in performance.

Consecutive calls to `__builtin_memcpy_inline` (implementation behind `builtin::Memcpy::block_offset`) are not coalesced by the compiler and so calling it three times in a row generates the desired assembly. It only differs in the interleaving of the loads and stores and does not affect performance.

This is needed to reland https://github.com/llvm/llvm-project/pull/108939.


>From 498dedeec762b32aa8c65205f6df207d7e8657ec Mon Sep 17 00:00:00 2001
From: Guillaume Chatelet <gchatelet at google.com>
Date: Mon, 21 Oct 2024 13:03:38 +0000
Subject: [PATCH] [libc][x86] copy one cache line at a time to prevent the use
 of `rep;movsb`

When using `-mprefer-vector-width=128` with `-march=sandybridge` copying 3 cache lines in one go (192B) gets converted into `rep;movsb` which translate into a 60% hit in performance.

Consecutive calls to `__builtin_memcpy_inline` (implementation behind `builtin::Memcpy::block_offset`) are not coalesced by the compiler and so calling it three times in a row generates the desired assembly. It only differs in the interleaving of the loads and stores and does not affect performance.

This is needed to reland https://github.com/llvm/llvm-project/pull/108939.
---
 .../string/memory_utils/x86_64/inline_memcpy.h  | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/libc/src/string/memory_utils/x86_64/inline_memcpy.h b/libc/src/string/memory_utils/x86_64/inline_memcpy.h
index 2b2c6e6fbc5466..68f64fb1a5023b 100644
--- a/libc/src/string/memory_utils/x86_64/inline_memcpy.h
+++ b/libc/src/string/memory_utils/x86_64/inline_memcpy.h
@@ -98,8 +98,9 @@ inline_memcpy_x86_sse2_ge64_sw_prefetching(Ptr __restrict dst,
     while (offset + K_TWO_CACHELINES + 32 <= count) {
       inline_memcpy_prefetch(dst, src, offset + K_ONE_CACHELINE);
       inline_memcpy_prefetch(dst, src, offset + K_TWO_CACHELINES);
-      builtin::Memcpy<K_TWO_CACHELINES>::block_offset(dst, src, offset);
-      offset += K_TWO_CACHELINES;
+      // Copy one cache line at a time to prevent the use of `rep;movsb`.
+      for (size_t i = 0; i < 2; ++i, offset += K_ONE_CACHELINE)
+        builtin::Memcpy<K_ONE_CACHELINE>::block_offset(dst, src, offset);
     }
   } else {
     // Three cache lines at a time.
@@ -107,10 +108,9 @@ inline_memcpy_x86_sse2_ge64_sw_prefetching(Ptr __restrict dst,
       inline_memcpy_prefetch(dst, src, offset + K_ONE_CACHELINE);
       inline_memcpy_prefetch(dst, src, offset + K_TWO_CACHELINES);
       inline_memcpy_prefetch(dst, src, offset + K_THREE_CACHELINES);
-      // It is likely that this copy will be turned into a 'rep;movsb' on
-      // non-AVX machines.
-      builtin::Memcpy<K_THREE_CACHELINES>::block_offset(dst, src, offset);
-      offset += K_THREE_CACHELINES;
+      // Copy one cache line at a time to prevent the use of `rep;movsb`.
+      for (size_t i = 0; i < 3; ++i, offset += K_ONE_CACHELINE)
+        builtin::Memcpy<K_ONE_CACHELINE>::block_offset(dst, src, offset);
     }
   }
   // We don't use 'loop_and_tail_offset' because it assumes at least one
@@ -148,8 +148,9 @@ inline_memcpy_x86_avx_ge64_sw_prefetching(Ptr __restrict dst,
     inline_memcpy_prefetch(dst, src, offset + K_ONE_CACHELINE);
     inline_memcpy_prefetch(dst, src, offset + K_TWO_CACHELINES);
     inline_memcpy_prefetch(dst, src, offset + K_THREE_CACHELINES);
-    builtin::Memcpy<K_THREE_CACHELINES>::block_offset(dst, src, offset);
-    offset += K_THREE_CACHELINES;
+    // Copy one cache line at a time to prevent the use of `rep;movsb`.
+    for (size_t i = 0; i < 3; ++i, offset += K_ONE_CACHELINE)
+      builtin::Memcpy<K_ONE_CACHELINE>::block_offset(dst, src, offset);
   }
   // We don't use 'loop_and_tail_offset' because it assumes at least one
   // iteration of the loop.