[libc-commits] [libc] 2f58ac4 - [libc][x86] copy one cache line at a time to prevent the use of `rep; movsb` (#113161)
via libc-commits
libc-commits at lists.llvm.org
Tue Oct 22 01:48:48 PDT 2024
Author: Guillaume Chatelet
Date: 2024-10-22T10:48:43+02:00
New Revision: 2f58ac4a22baa27c1e9aad1b3c6d5c687ef03721
URL: https://github.com/llvm/llvm-project/commit/2f58ac4a22baa27c1e9aad1b3c6d5c687ef03721
DIFF: https://github.com/llvm/llvm-project/commit/2f58ac4a22baa27c1e9aad1b3c6d5c687ef03721.diff
LOG: [libc][x86] copy one cache line at a time to prevent the use of `rep;movsb` (#113161)
When using `-mprefer-vector-width=128` with `-march=sandybridge` copying
3 cache lines in one go (192B) gets converted into `rep;movsb` which
translate into a 60% hit in performance.
Consecutive calls to `__builtin_memcpy_inline` (implementation behind
`builtin::Memcpy::block_offset`) are not coalesced by the compiler and
so calling it three times in a row generates the desired assembly. It
only differs in the interleaving of the loads and stores and does not
affect performance.
This is needed to reland
https://github.com/llvm/llvm-project/pull/108939.
Added:
Modified:
libc/src/string/memory_utils/x86_64/inline_memcpy.h
Removed:
################################################################################
diff --git a/libc/src/string/memory_utils/x86_64/inline_memcpy.h b/libc/src/string/memory_utils/x86_64/inline_memcpy.h
index 2b2c6e6fbc5466..68f64fb1a5023b 100644
--- a/libc/src/string/memory_utils/x86_64/inline_memcpy.h
+++ b/libc/src/string/memory_utils/x86_64/inline_memcpy.h
@@ -98,8 +98,9 @@ inline_memcpy_x86_sse2_ge64_sw_prefetching(Ptr __restrict dst,
while (offset + K_TWO_CACHELINES + 32 <= count) {
inline_memcpy_prefetch(dst, src, offset + K_ONE_CACHELINE);
inline_memcpy_prefetch(dst, src, offset + K_TWO_CACHELINES);
- builtin::Memcpy<K_TWO_CACHELINES>::block_offset(dst, src, offset);
- offset += K_TWO_CACHELINES;
+ // Copy one cache line at a time to prevent the use of `rep;movsb`.
+ for (size_t i = 0; i < 2; ++i, offset += K_ONE_CACHELINE)
+ builtin::Memcpy<K_ONE_CACHELINE>::block_offset(dst, src, offset);
}
} else {
// Three cache lines at a time.
@@ -107,10 +108,9 @@ inline_memcpy_x86_sse2_ge64_sw_prefetching(Ptr __restrict dst,
inline_memcpy_prefetch(dst, src, offset + K_ONE_CACHELINE);
inline_memcpy_prefetch(dst, src, offset + K_TWO_CACHELINES);
inline_memcpy_prefetch(dst, src, offset + K_THREE_CACHELINES);
- // It is likely that this copy will be turned into a 'rep;movsb' on
- // non-AVX machines.
- builtin::Memcpy<K_THREE_CACHELINES>::block_offset(dst, src, offset);
- offset += K_THREE_CACHELINES;
+ // Copy one cache line at a time to prevent the use of `rep;movsb`.
+ for (size_t i = 0; i < 3; ++i, offset += K_ONE_CACHELINE)
+ builtin::Memcpy<K_ONE_CACHELINE>::block_offset(dst, src, offset);
}
}
// We don't use 'loop_and_tail_offset' because it assumes at least one
@@ -148,8 +148,9 @@ inline_memcpy_x86_avx_ge64_sw_prefetching(Ptr __restrict dst,
inline_memcpy_prefetch(dst, src, offset + K_ONE_CACHELINE);
inline_memcpy_prefetch(dst, src, offset + K_TWO_CACHELINES);
inline_memcpy_prefetch(dst, src, offset + K_THREE_CACHELINES);
- builtin::Memcpy<K_THREE_CACHELINES>::block_offset(dst, src, offset);
- offset += K_THREE_CACHELINES;
+ // Copy one cache line at a time to prevent the use of `rep;movsb`.
+ for (size_t i = 0; i < 3; ++i, offset += K_ONE_CACHELINE)
+ builtin::Memcpy<K_ONE_CACHELINE>::block_offset(dst, src, offset);
}
// We don't use 'loop_and_tail_offset' because it assumes at least one
// iteration of the loop.
More information about the libc-commits
mailing list