[libc-commits] [PATCH] D148717: [libc] Improve memcmp latency and codegen

Wed May 3 10:04:56 PDT 2023

nafi3000 added inline comments.

================
Comment at: libc/src/string/memory_utils/op_generic.h:466-476
+    if constexpr (cmp_is_expensive<T>::value) {
+      if (!eq<T>(p1, p2, 0))
+        return cmp_neq<T>(p1, p2, 0);
+    } else {
+      if (auto value = cmp<T>(p1, p2, 0))
         return value;
+    }
----------------
I wonder if it is better to use `cmp<T>` only for the last comparison. Motivation is that for non-last compare blocks we need to check the comparison result anyway (e.g. line 470 above) to decide whether to load and compare the next block in the sequence. Isn't it better to compute this decision (0 or non-0) as early as possible instead of computing the full cmp result (0, <0 or >0)?

E.g.

    if constexpr (sizeof...(TS) == 0) {
      if constexpr (cmp_is_expensive<T>::value) {
        if (eq<T>(p1, p2, 0))
          return MemcmpReturnType::ZERO();
        return cmp_neq<T>(p1, p2, 0);
      } else {
        return cmp<T>(p1, p2, 0);
      }
    } else {
      if (!eq<T>(p1, p2, 0))
        return cmp_neq<T>(p1, p2, 0);
      return MemcmpSequence<TS...>::block(p1 + sizeof(T), p2 + sizeof(T));
    }

And, for the last block, I wonder if we can invariably call `cmp<T>` instead. What is better would depend on data. E.g. for `__m512i`, `cmp<T>` is faster if there is at least 1 byte mismatch in the last 64 bytes.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148717/new/

https://reviews.llvm.org/D148717