[llvm-dev] [RFC] Adding a -memeq-lib-function flag to allow the user to specify a memeq function.

Fri Jan 11 04:45:56 PST 2019

On Thu, Jan 10, 2019 at 4:47 PM Clement Courbet <courbet at google.com> wrote:

>
>
> On Wed, Jan 9, 2019 at 6:16 PM James Y Knight <jyknight at google.com> wrote:
>
>>
>>
>> On Tue, Jan 8, 2019 at 9:24 AM Clement Courbet <courbet at google.com>
>> wrote:
>>
>>>
>>>
>>> On Mon, Jan 7, 2019 at 10:26 PM James Y Knight <jyknight at google.com>
>>> wrote:
>>>
>> I'm afraid about the "almost" and "generally": what about users who don't
>>> ?
>>>
>>
>> Even so, it should be fine to enable it for those platforms which do
>> include it.
>>
>> I do note, sadly, that currently out of all these implementations, only
>>>> NetBSD and FreeBSD seem to actually define a separate more optimized bcmp
>>>> function. That does mean that this optimization would be effectively a
>>>> no-op, for the vast majority of people.
>>>>
>>>
>>> This might or might not be considered really an issue.
>>>
>>
>> Right, the issue is adding an effectively useless optimization in llvm.
>>
>>  - In my original proposal, people have to explicitly opt-in to the
>>> feature and link to their memcmp implementation, they do not get the
>>> improvement automatically.
>>>  - In this proposal, they have to patch their libc, which might be
>>> slightly more painful depending on the system.
>>>
>>
>> Users may also include a function named bcmp in their binary, which will
>> overrides the one from libc.
>>
>> Here's a patch with this proposal to see what this looks like:
>>> https://reviews.llvm.org/D56436
>>>
>>
>> It feels like this optimization would be better done in
>> llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp,
>>
>
>  I'll have a look at this approach.
>

You're right, that's a better place to do this indeed:
https://reviews.llvm.org/D56593

>
>
>>
>>  But if you can show a similar performance win in public code, it'd be
>>>> great to attempt to push a more optimized version upstream at least to
>>>> glibc. Some more precise numbers than "very large improvement" are probably
>>>> necessary to show it's actually worth it. :)
>>>>
>>>
>>> We were planning to contribute it to compiler-rt. Contributing a
>>> deprecated function to the libc sounds.... weird.
>>>
>>
>> Yes, contributing an optimization for a deprecated function is indeed
>> weird. Thus the importance of reliable performance numbers justifying the
>> importance -- I'd never have thought that the performance cost of returning
>> an ordering from memcmp would be important, and I suspect nobody else did.
>>
>
> Fair enough, let me give some numbers for this change.
> Before that, a caveat with any benchmarks for comparing strings is that
> the results depend a lot the distribution of sizes and content of these
> strings. So it makes more sense to benchmark an actual application, and we
> have our own custom benchmarks.
> That being said, one of the cases where we have found this optimization to
> be impactful is `operator==(const string&, const string&)`. libcxx has a family
> of benchmarks for `BM_StringRelational_Eq`)
> <https://github.com/llvm-mirror/libcxx/blob/master/benchmarks/algorithms.bench.cpp>,
> which I'm going to use here.
>
> BM_StringRelational_Eq benchmarks comparison of strings of size 7 (Small),
> 63 (Large) and 5k (Huge) characters, in four scenarios (scenarii ?):
>  - The equal case (Control), which is theoretically the worst case as you
> have to prove that all bytes are equal.
>  - The case when strings differ. In that case bcmp() only needs to prove
> that one byte differs to return nonzero. Typical cases where strings differ
> are at the start of the string (ChangeFirst), but also, interestingly, at
> the end (ChangeLast, when you are comparing strings with a common prefix,
> which happens frequently e.g. when comparing subsequent elements of a
> sorted list of strings). Another interesting case is the case when the
> change position is in the middle (ChangeMiddle).
>
> For this comparison, I'm using as base the call to `memcmp`, and as
> experiment the following crude bcmp() implementation (I'm assuming X86_64),
> that shows how we can take advantage of the observations above to optimize
> typical cases.
>
> ```
> #define UNALIGNED_LOAD64(_p) (*reinterpret_cast<const uint64_t *>(_p))
>
> extern "C" int bcmp(const void* p1, const void* p2, size_t n) throw() {
>   const char* a = reinterpret_cast<const char*>(p1);
>   const char* b = reinterpret_cast<const char*>(p2);
>   if (n >= 8) {
>     uint64_t u = UNALIGNED_LOAD64(a) ^ UNALIGNED_LOAD64(b);
>     uint64_t v = UNALIGNED_LOAD64(a + n - 8) ^ UNALIGNED_LOAD64(b + n - 8);
>     if ((u | v) != 0) {
>       return 1;
>     }
>   }
>   return memcmp(a, b, n);
> }
> ```
>
> Note that:
>  - there is a bit of noise in the results, but you'll see that this quite
> dumb bcmp() reasonably improves {Large,Huge}_{ChangeFirst,ChangeLast} (note
> the the  improvement to {Large,Huge}_ChangeLast cannot be achieved with
> the semantics of memcmp) without hurting the `ChangeMiddle` and `Control`
> cases.
>  - the small string case (size==7) is not modified by our change because
> there is a "fast path" for very small sizes on operator==.
>  - We are still experimenting with the final bcmp() implementation (in
> particular improving the `Control` and `ChangeMiddle` cases by improving
> parallelism). Our current version is better than memcmp() across the
> board on X86.
>
>
>                                                 base (ns)   exp (ns)
> BM_StringRelational_Eq_Empty_Empty_Control      1.65        1.7
> BM_StringRelational_Eq_Empty_Small_Control      1.37        1.4
> BM_StringRelational_Eq_Empty_Large_Control      1.37        1.44
> BM_StringRelational_Eq_Empty_Huge_Control       1.38        1.44
> BM_StringRelational_Eq_Small_Small_Control      6.53        6.51
> BM_StringRelational_Eq_Small_Small_ChangeFirst  1.96        1.94
> BM_StringRelational_Eq_Small_Small_ChangeMiddle 5.06        4.95
> BM_StringRelational_Eq_Small_Small_ChangeLast   6.77        6.84
> BM_StringRelational_Eq_Small_Large_Control      1.38        1.41
> BM_StringRelational_Eq_Small_Huge_Control       1.37        1.39
> BM_StringRelational_Eq_Large_Large_Control      5.54        5.8
> BM_StringRelational_Eq_Large_Large_ChangeFirst  6.25        3.06
> BM_StringRelational_Eq_Large_Large_ChangeMiddle 5.5         5.94
> BM_StringRelational_Eq_Large_Large_ChangeLast   6.04        3.42
> BM_StringRelational_Eq_Large_Huge_Control       1.1         1.1
> BM_StringRelational_Eq_Huge_Huge_Control        118         118
> BM_StringRelational_Eq_Huge_Huge_ChangeFirst    5.65        3.02
> BM_StringRelational_Eq_Huge_Huge_ChangeMiddle   69.5        66.9
> BM_StringRelational_Eq_Huge_Huge_ChangeLast     118         3.43
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190111/d3fc3a68/attachment.html>