[libc-commits] [libc] [libc] Fix buggy AVX2 `memcmp` (PR #77081)

Sat Jan 6 03:11:09 PST 2024

================
@@ -181,11 +182,31 @@ LIBC_INLINE __m256i bytewise_max(__m256i a, __m256i b) {
   return _mm256_max_epu8(a, b);
 }
 LIBC_INLINE __m256i bytewise_reverse(__m256i value) {
-  return _mm256_shuffle_epi8(value,
-                             _mm256_set_epi8(0, 1, 2, 3, 4, 5, 6, 7,         //
-                                             8, 9, 10, 11, 12, 13, 14, 15,   //
-                                             16, 17, 18, 19, 20, 21, 22, 23, //
-                                             24, 25, 26, 27, 28, 29, 30, 31));
+  const __m256i indices = _mm256_set_epi8(0, 1, 2, 3, 4, 5, 6, 7,         //
+                                          8, 9, 10, 11, 12, 13, 14, 15,   //
+                                          16, 17, 18, 19, 20, 21, 22, 23, //
+                                          24, 25, 26, 27, 28, 29, 30, 31);
+#if defined(__AVX512VBMI__) && defined(__AVX512VL__)
+  // AVX512 allows full __m256i byte permutation.
+  // ymm = ymm[31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,
+  //           15,14,13,12,11,10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
+  return _mm256_permutexvar_epi8(value, indices);
+#else
+  // We can't byte-reverse __m256i in a single instruction with AVX2.
+  // '_mm256_shuffle_epi8' can only shuffle within each xmm lane
+  // leading to:
+  // ymm = ymm[15,14,13,12,11,10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0,
+  //           31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16]
+  const __m256i tmp = _mm256_shuffle_epi8(value, indices);
+  // Then we shuffle accross lanes using 64 bit values.
+  // ymm = ymm[2,3,0,1]
+  // Leading to a fully reversed vector
+  // ymm = ymm[31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,
+  //           15,14,13,12,11,10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
+  // The immediate encodes the 64 bit word indices  :    1, 0, 3, 2.
+  // Each index is encoded with 2 bits              : 0b01'00'11'10.
+  return _mm256_permute4x64_epi64(tmp, 0b01'00'11'10);
----------------
nafi3000 wrote:

Consider doing this after calling the `_mm256_movemask_epi8`. Then a `rorx` (1 cycle) will be sufficient. E.g.
```cc
static inline uint32_t SwapWords(uint32_t x) {
  return (x << 16) | (x >> 16);
}

int cmp_neq(__m256i a, __m256i b) {
  __m256i vmax = _mm256_max_epu8(a, b);
  __m256i a_le_b = _mm256_cmpeq_epi8(vmax, b);
  __m256i a_ge_b = _mm256_cmpeq_epi8(vmax, a);
  const __m256i indices = _mm256_set_epi8(0, 1, 2, 3, 4, 5, 6, 7,         //
                                          8, 9, 10, 11, 12, 13, 14, 15,   //
                                          0, 1, 2, 3, 4, 5, 6, 7,         //
                                          8, 9, 10, 11, 12, 13, 14, 15);
  uint32_t le = SwapWords(_mm256_movemask_epi8(_mm256_shuffle_epi8(a_le_b, indices)));
  uint32_t ge = SwapWords(_mm256_movemask_epi8(_mm256_shuffle_epi8(a_ge_b, indices)));
  return le < ge ? 5 : -5;
}
```
https://godbolt.org/z/6bhjE35j3

Version|llvm-mca|skylake latency|znver3 latency|skylake latency without mask loading|znver3 latency without mask loading
:-:|:-:|:-:|:-:|:-:|:-:
[With bug](https://godbolt.org/z/ean49e9Pn)|[link](https://godbolt.org/z/d5Ej6K5PP)|13|13|8|7
[#else version above](https://godbolt.org/z/nj8o9eK5M)|[link](https://godbolt.org/z/rPxcnKzhh)|17|19|12|13
[#if version above](https://godbolt.org/z/jesrK8568)|[link](https://godbolt.org/z/r77bdb4fq)|15|17|10|10
[shuffle-movemask-rorx version](https://godbolt.org/z/6bhjE35j3)|[link](https://godbolt.org/z/11rbWzbnv)|14|14|9|8

I have tested the shuffle-movemask-rorx version: https://godbolt.org/z/as5Kqjof5. The tests fail without the `rorx` fix: https://godbolt.org/z/ccndGYM6z.

What do you think?

https://github.com/llvm/llvm-project/pull/77081