[libc-commits] [PATCH] D148717: [libc] Improve memcmp latency and codegen

Tue May 2 11:32:06 PDT 2023

nafi3000 added inline comments.

================
Comment at: libc/src/string/memory_utils/bcmp_implementations.h:88
+  if (count == 7)
+    return generic::BcmpSequence<uint32_t, uint16_t, uint8_t>::block(p1, p2);
+  if (count == 8)
----------------
OOC, how about using head_tail (2 loads) instead of BcmpSequence of 3 loads? E.g.

generic::Bcmp<uint32_t>::head_tail(p1, p2, 7)

================
Comment at: libc/src/string/memory_utils/op_generic.h:617
+  const auto b = load_be<uint64_t>(p2, offset);
+  return a < b ? -1 : 1;
+}
----------------
OOC, have you ever tried other values? E.g. how about:
```
return a > b ? 0x7fffffff : 0x80000000;
```

or if it does not compile due to the `MemcmpReturnType`, then:
```
return a > b ? static_cast<int32_t>(0x7fffffff) : static_cast<int32_t>(0x80000000);
```

`-1` and `1` are 2 values apart.
`0x7fffffff` and `0x80000000` are 1 value apart.

assembly diff: https://www.diffchecker.com/LMVfxJ1D/

  xor eax, eax	
  cmp rcx, r8	
  sbb eax, eax	
  or eax, 1

vs

  cmp r8, rcx
  mov eax, -2147483648
  sbb eax, 0

In theory... the former should take 3 cycles (`or` waiting for `sbb`, `sbb` waiting for `cmp`) while the latter should take 2 cycles (`cmp` and `mov` should happen in parallel, `sbb` happening after the `cmp`), right?

================
Comment at: libc/src/string/memory_utils/op_x86.h:72-73
+LIBC_INLINE __m128i bytewise_reverse(__m128i value) {
+  return _mm_shuffle_epi8(value, _mm_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, //
+                                              8, 9, 10, 11, 12, 13, 14, 15));
 }
----------------
[optional nit] Maybe not in this diff, but eventually we can programmatically generate the sequences here and at lines 129 and 160 below.

================
Comment at: libc/src/string/memory_utils/op_x86.h:145-146
+  static_assert(cpp::is_same_v<cpp::remove_cv_t<decltype(le)>, uint32_t>);
+  const int64_t diff = static_cast<int64_t>(ge) - static_cast<int64_t>(le);
+  return static_cast<int32_t>((diff >> 1) | (diff & 0xFFFF));
+}
----------------
Would it make sense to factor out this part to another function?
This is used here and for cmp<uint32_t>.

================
Comment at: libc/src/string/memory_utils/op_x86.h:191
+  static_assert(cpp::is_same_v<cpp::remove_cv_t<decltype(le)>, uint64_t>);
+  return ge < le ? -1 : 1;
+}
----------------
Ditto, -1 : 1 vs 0x80000000 : 0x7fffffff

================
Comment at: libc/src/string/memory_utils/x86_64/memcmp_implementations.h:75
+  if (count == 7)
+    return generic::MemcmpSequence<uint32_t, uint16_t, uint8_t>::block(p1, p2);
+  if (count == 8)
----------------
Ditto. Similar to the bcmp comment, how about using head_tail (2 loads) instead of MemcmpSequence of 3 loads? E.g.

generic::Memcmp<uint32_t>::head_tail(p1, p2, 7)

x86 asm diff:
https://www.diffchecker.com/XQNu3lGN/

================
Comment at: libc/test/src/string/memory_utils/op_tests.cpp:226
+    generic::BcmpSequence<uint16_t, uint8_t>,         //
+    generic::BcmpSequence<uint32_t, uint16_t, uint8_t>>;

----------------
Do we need to add `generic::BcmpSequence<uint32_t, uint8_t>` and `generic::BcmpSequence<uint32_t, uint16_t>` here too? I am interpreting the above list as:
8, 1, 2, 4, 1+1, 1+1+1, 2+1, 4+2+1

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148717/new/

https://reviews.llvm.org/D148717