[libc-commits] [PATCH] D148717: [libc] Improve memcmp latency and codegen

Thu May 11 04:48:27 PDT 2023

gchatelet marked an inline comment as done.
gchatelet added inline comments.

================
Comment at: libc/src/string/memory_utils/bcmp_implementations.h:38
 #if defined(LIBC_TARGET_ARCH_IS_X86)
+#if defined(__SSE4_1__)
 [[maybe_unused]] LIBC_INLINE BcmpReturnType
----------------
lntue wrote:
> Currently we only check for `SSE4.2` in our CMake build https://github.com/llvm/llvm-project/blob/main/libc/cmake/modules/LLVMLibCCheckCpuFeatures.cmake#L9
> 
> Do you want to change or add `SSE4.1` to the list instead?
Technically this code only needs `SSE4.1` but I don't think it's worth discriminating between the two.
https://en.wikipedia.org/wiki/SSE4#SSE4_subsets

Looking at [Steam hardware survey](https://store.steampowered.com/hwsurvey/) (click the `Other Settings` line at the bottom of the page) the share of cpu having SSE4.1 but not SSE4.2 is about 0.23%.
 - SSE4.1 : 99.36%
 - SSE4.2 : 99.13%

So we can basically only discriminate between `SSE2` and `SSE4.2` and call it a day : )

================
Comment at: libc/src/string/memory_utils/op_generic.h:466-476
+    if constexpr (cmp_is_expensive<T>::value) {
+      if (!eq<T>(p1, p2, 0))
+        return cmp_neq<T>(p1, p2, 0);
+    } else {
+      if (auto value = cmp<T>(p1, p2, 0))
         return value;
+    }
----------------
nafi3000 wrote:
> I wonder if it is better to use `cmp<T>` only for the last comparison. Motivation is that for non-last compare blocks we need to check the comparison result anyway (e.g. line 470 above) to decide whether to load and compare the next block in the sequence. Isn't it better to compute this decision (0 or non-0) as early as possible instead of computing the full cmp result (0, <0 or >0)?
> 
> E.g.
> 
>     if constexpr (sizeof...(TS) == 0) {
>       if constexpr (cmp_is_expensive<T>::value) {
>         if (eq<T>(p1, p2, 0))
>           return MemcmpReturnType::ZERO();
>         return cmp_neq<T>(p1, p2, 0);
>       } else {
>         return cmp<T>(p1, p2, 0);
>       }
>     } else {
>       if (!eq<T>(p1, p2, 0))
>         return cmp_neq<T>(p1, p2, 0);
>       return MemcmpSequence<TS...>::block(p1 + sizeof(T), p2 + sizeof(T));
>     }
> 
> And, for the last block, I wonder if we can invariably call `cmp<T>` instead. What is better would depend on data. E.g. for `__m512i`, `cmp<T>` is faster if there is at least 1 byte mismatch in the last 64 bytes.
The current benchmark works for all memory functions and is only able to assess the throughput of functions under a particular size distribution.
As-is, it is not a good tool to evaluate functions that may return early (`bcmp` and `memcmp`) and for which latency is important.

I'll work on adding a latency benchmark based on the one you provided me earlier. Once it's done I think we will be in a better position to decide which strategy is better.

SGTY?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148717/new/

https://reviews.llvm.org/D148717