[libc-commits] [PATCH] D148717: [libc] Improve memcmp latency and codegen

Fri Jun 2 19:33:28 PDT 2023

nafi3000 accepted this revision.
nafi3000 added inline comments.
This revision is now accepted and ready to land.

================
Comment at: libc/src/string/memory_utils/bcmp_implementations.h:88
+  if (count == 7)
+    return generic::BcmpSequence<uint32_t, uint16_t, uint8_t>::block(p1, p2);
+  if (count == 8)
----------------
gchatelet wrote:
> nafi3000 wrote:
> > OOC, how about using head_tail (2 loads) instead of BcmpSequence of 3 loads? E.g.
> > 
> > generic::Bcmp<uint32_t>::head_tail(p1, p2, 7)
> > 
> > 
> Yeah there are a bunch of options here. Usually I want to use `head_tail` to cover a range of sizes as it clearly diminishes the overall code size (all sizes from 5 to 8 with only two loads per pointer).
> Depending on how often those sizes appear in the size distribution it might be useful to special case the code.
> I'm tempted to keep the previous logic to prevent regressions and make this a separate change. WDYT?
Separate change SGTM.

================
Comment at: libc/src/string/memory_utils/op_generic.h:466-476
+    if constexpr (cmp_is_expensive<T>::value) {
+      if (!eq<T>(p1, p2, 0))
+        return cmp_neq<T>(p1, p2, 0);
+    } else {
+      if (auto value = cmp<T>(p1, p2, 0))
         return value;
+    }
----------------
gchatelet wrote:
> nafi3000 wrote:
> > I wonder if it is better to use `cmp<T>` only for the last comparison. Motivation is that for non-last compare blocks we need to check the comparison result anyway (e.g. line 470 above) to decide whether to load and compare the next block in the sequence. Isn't it better to compute this decision (0 or non-0) as early as possible instead of computing the full cmp result (0, <0 or >0)?
> > 
> > E.g.
> > 
> >     if constexpr (sizeof...(TS) == 0) {
> >       if constexpr (cmp_is_expensive<T>::value) {
> >         if (eq<T>(p1, p2, 0))
> >           return MemcmpReturnType::ZERO();
> >         return cmp_neq<T>(p1, p2, 0);
> >       } else {
> >         return cmp<T>(p1, p2, 0);
> >       }
> >     } else {
> >       if (!eq<T>(p1, p2, 0))
> >         return cmp_neq<T>(p1, p2, 0);
> >       return MemcmpSequence<TS...>::block(p1 + sizeof(T), p2 + sizeof(T));
> >     }
> > 
> > And, for the last block, I wonder if we can invariably call `cmp<T>` instead. What is better would depend on data. E.g. for `__m512i`, `cmp<T>` is faster if there is at least 1 byte mismatch in the last 64 bytes.
> The current benchmark works for all memory functions and is only able to assess the throughput of functions under a particular size distribution.
> As-is, it is not a good tool to evaluate functions that may return early (`bcmp` and `memcmp`) and for which latency is important.
> 
> I'll work on adding a latency benchmark based on the one you provided me earlier. Once it's done I think we will be in a better position to decide which strategy is better.
> 
> SGTY?
Sounds good.

================
Comment at: libc/src/string/memory_utils/op_x86.h:72-73
+LIBC_INLINE __m128i bytewise_reverse(__m128i value) {
+  return _mm_shuffle_epi8(value, _mm_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, //
+                                              8, 9, 10, 11, 12, 13, 14, 15));
 }
----------------
gchatelet wrote:
> nafi3000 wrote:
> > [optional nit] Maybe not in this diff, but eventually we can programmatically generate the sequences here and at lines 129 and 160 below.
> AFAICT we can't do it with the intel intrinsics as they are real functions expecting a certain number of arguments.
> It may be possible to generate  them with GCC and clang vector extensions though and then convert them back to Intel types.
> 
> I gave it a try but it's brittle on clang and fails on GCC
> https://godbolt.org/z/Ms7fW5nP3
> 
> Not sure it's worth it.
We can use _mm_load_si128 instead of _mm_set_epi8. Snappy code has some example:
https://github.com/google/snappy/blob/main/snappy.cc
Search for `pattern_generation_masks`.

Anyway, this can be addressed in a separate diff. And like you mention, it may not be worth it. In snappy, we actually need a array of such shuffle masks. But in this case we just need one.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148717/new/

https://reviews.llvm.org/D148717