[libc-commits] [PATCH] D148717: [libc] Improve memcmp latency and codegen

Wed May 10 07:03:45 PDT 2023

gchatelet added inline comments.

================
Comment at: libc/src/string/memory_utils/bcmp_implementations.h:88
+  if (count == 7)
+    return generic::BcmpSequence<uint32_t, uint16_t, uint8_t>::block(p1, p2);
+  if (count == 8)
----------------
nafi3000 wrote:
> OOC, how about using head_tail (2 loads) instead of BcmpSequence of 3 loads? E.g.
> 
> generic::Bcmp<uint32_t>::head_tail(p1, p2, 7)
> 
> 
Yeah there are a bunch of options here. Usually I want to use `head_tail` to cover a range of sizes as it clearly diminishes the overall code size (all sizes from 5 to 8 with only two loads per pointer).
Depending on how often those sizes appear in the size distribution it might be useful to special case the code.
I'm tempted to keep the previous logic to prevent regressions and make this a separate change. WDYT?

================
Comment at: libc/src/string/memory_utils/op_generic.h:617
+  const auto b = load_be<uint64_t>(p2, offset);
+  return a < b ? -1 : 1;
+}
----------------
nafi3000 wrote:
> OOC, have you ever tried other values? E.g. how about:
> ```
> return a > b ? 0x7fffffff : 0x80000000;
> ```
> 
> or if it does not compile due to the `MemcmpReturnType`, then:
> ```
> return a > b ? static_cast<int32_t>(0x7fffffff) : static_cast<int32_t>(0x80000000);
> ```
> 
> `-1` and `1` are 2 values apart.
> `0x7fffffff` and `0x80000000` are 1 value apart.
> 
> assembly diff: https://www.diffchecker.com/LMVfxJ1D/
> 
>   xor eax, eax	
>   cmp rcx, r8	
>   sbb eax, eax	
>   or eax, 1
> 
> vs
> 
>   cmp r8, rcx
>   mov eax, -2147483648
>   sbb eax, 0
> 
> In theory... the former should take 3 cycles (`or` waiting for `sbb`, `sbb` waiting for `cmp`) while the latter should take 2 cycles (`cmp` and `mov` should happen in parallel, `sbb` happening after the `cmp`), right?
Nice one. Picking other values was on my TODO list but I never thought it through.

I had a look at codegen for armv8 it uses `cinv` instead of `cneg` but it seems to be neutral in terms of performance.
https://godbolt.org/z/Y9aGq5sPd

For x86 in theory this should be better yes 👍.
https://godbolt.org/z/69Gefhqef

For RISC-V it seems it's worse as it generates a branch.
https://godbolt.org/z/bvqMeMjMP
This seems to be in line with what I found on [stackoverflow](https://stackoverflow.com/a/72341794)

Now I tried it on the full implementation and the additional branch seems to be outlined leading to the same code size. I don't know the impact on the code speed but since we don't yet have an optimized version of memcmp for RISC-V we can happily revisit the function later on.
I've created a separate function so we can keep track of the rationale.

================
Comment at: libc/src/string/memory_utils/op_x86.h:72-73
+LIBC_INLINE __m128i bytewise_reverse(__m128i value) {
+  return _mm_shuffle_epi8(value, _mm_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, //
+                                              8, 9, 10, 11, 12, 13, 14, 15));
 }
----------------
nafi3000 wrote:
> [optional nit] Maybe not in this diff, but eventually we can programmatically generate the sequences here and at lines 129 and 160 below.
AFAICT we can't do it with the intel intrinsics as they are real functions expecting a certain number of arguments.
It may be possible to generate  them with GCC and clang vector extensions though and then convert them back to Intel types.

I gave it a try but it's brittle on clang and fails on GCC
https://godbolt.org/z/Ms7fW5nP3

Not sure it's worth it.

================
Comment at: libc/src/string/memory_utils/op_x86.h:145-146
+  static_assert(cpp::is_same_v<cpp::remove_cv_t<decltype(le)>, uint32_t>);
+  const int64_t diff = static_cast<int64_t>(ge) - static_cast<int64_t>(le);
+  return static_cast<int32_t>((diff >> 1) | (diff & 0xFFFF));
+}
----------------
nafi3000 wrote:
> Would it make sense to factor out this part to another function?
> This is used here and for cmp<uint32_t>.
Yes, I've done so for the `uint64_t` so let's factor this out for `uint32_t` as well.

================
Comment at: libc/test/src/string/memory_utils/op_tests.cpp:226
+    generic::BcmpSequence<uint16_t, uint8_t>,         //
+    generic::BcmpSequence<uint32_t, uint16_t, uint8_t>>;

----------------
nafi3000 wrote:
> Do we need to add `generic::BcmpSequence<uint32_t, uint8_t>` and `generic::BcmpSequence<uint32_t, uint16_t>` here too? I am interpreting the above list as:
> 8, 1, 2, 4, 1+1, 1+1+1, 2+1, 4+2+1
Done. More coverage doesn't hurt : )

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148717/new/

https://reviews.llvm.org/D148717