[libc-commits] [PATCH] D148717: [libc] Improve memcmp latency and codegen
Nafi Rouf via Phabricator via libc-commits
libc-commits at lists.llvm.org
Tue May 2 11:32:06 PDT 2023
nafi3000 added inline comments.
================
Comment at: libc/src/string/memory_utils/bcmp_implementations.h:88
+ if (count == 7)
+ return generic::BcmpSequence<uint32_t, uint16_t, uint8_t>::block(p1, p2);
+ if (count == 8)
----------------
OOC, how about using head_tail (2 loads) instead of BcmpSequence of 3 loads? E.g.
generic::Bcmp<uint32_t>::head_tail(p1, p2, 7)
================
Comment at: libc/src/string/memory_utils/op_generic.h:617
+ const auto b = load_be<uint64_t>(p2, offset);
+ return a < b ? -1 : 1;
+}
----------------
OOC, have you ever tried other values? E.g. how about:
```
return a > b ? 0x7fffffff : 0x80000000;
```
or if it does not compile due to the `MemcmpReturnType`, then:
```
return a > b ? static_cast<int32_t>(0x7fffffff) : static_cast<int32_t>(0x80000000);
```
`-1` and `1` are 2 values apart.
`0x7fffffff` and `0x80000000` are 1 value apart.
assembly diff: https://www.diffchecker.com/LMVfxJ1D/
xor eax, eax
cmp rcx, r8
sbb eax, eax
or eax, 1
vs
cmp r8, rcx
mov eax, -2147483648
sbb eax, 0
In theory... the former should take 3 cycles (`or` waiting for `sbb`, `sbb` waiting for `cmp`) while the latter should take 2 cycles (`cmp` and `mov` should happen in parallel, `sbb` happening after the `cmp`), right?
================
Comment at: libc/src/string/memory_utils/op_x86.h:72-73
+LIBC_INLINE __m128i bytewise_reverse(__m128i value) {
+ return _mm_shuffle_epi8(value, _mm_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, //
+ 8, 9, 10, 11, 12, 13, 14, 15));
}
----------------
[optional nit] Maybe not in this diff, but eventually we can programmatically generate the sequences here and at lines 129 and 160 below.
================
Comment at: libc/src/string/memory_utils/op_x86.h:145-146
+ static_assert(cpp::is_same_v<cpp::remove_cv_t<decltype(le)>, uint32_t>);
+ const int64_t diff = static_cast<int64_t>(ge) - static_cast<int64_t>(le);
+ return static_cast<int32_t>((diff >> 1) | (diff & 0xFFFF));
+}
----------------
Would it make sense to factor out this part to another function?
This is used here and for cmp<uint32_t>.
================
Comment at: libc/src/string/memory_utils/op_x86.h:191
+ static_assert(cpp::is_same_v<cpp::remove_cv_t<decltype(le)>, uint64_t>);
+ return ge < le ? -1 : 1;
+}
----------------
Ditto, -1 : 1 vs 0x80000000 : 0x7fffffff
================
Comment at: libc/src/string/memory_utils/x86_64/memcmp_implementations.h:75
+ if (count == 7)
+ return generic::MemcmpSequence<uint32_t, uint16_t, uint8_t>::block(p1, p2);
+ if (count == 8)
----------------
Ditto. Similar to the bcmp comment, how about using head_tail (2 loads) instead of MemcmpSequence of 3 loads? E.g.
generic::Memcmp<uint32_t>::head_tail(p1, p2, 7)
x86 asm diff:
https://www.diffchecker.com/XQNu3lGN/
================
Comment at: libc/test/src/string/memory_utils/op_tests.cpp:226
+ generic::BcmpSequence<uint16_t, uint8_t>, //
+ generic::BcmpSequence<uint32_t, uint16_t, uint8_t>>;
----------------
Do we need to add `generic::BcmpSequence<uint32_t, uint8_t>` and `generic::BcmpSequence<uint32_t, uint16_t>` here too? I am interpreting the above list as:
8, 1, 2, 4, 1+1, 1+1+1, 2+1, 4+2+1
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D148717/new/
https://reviews.llvm.org/D148717
More information about the libc-commits
mailing list