[llvm-dev] [RFC] Adding a -memeq-lib-function flag to allow the user to specify a memeq function.

Mon Feb 4 23:28:44 PST 2019

I'd like to move forward with this. Since there are no objections to the
approach suggested by James, I've tentatively sent the corresponding v5
implementation (https://reviews.llvm.org/D56593) for review.

On Fri, Jan 11, 2019 at 1:45 PM Clement Courbet <courbet at google.com> wrote:

>
>
> On Thu, Jan 10, 2019 at 4:47 PM Clement Courbet <courbet at google.com>
> wrote:
>
>>
>>
>> On Wed, Jan 9, 2019 at 6:16 PM James Y Knight <jyknight at google.com>
>> wrote:
>>
>>>
>>>
>>> On Tue, Jan 8, 2019 at 9:24 AM Clement Courbet <courbet at google.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Jan 7, 2019 at 10:26 PM James Y Knight <jyknight at google.com>
>>>> wrote:
>>>>
>>> I'm afraid about the "almost" and "generally": what about users who
>>>> don't ?
>>>>
>>>
>>> Even so, it should be fine to enable it for those platforms which do
>>> include it.
>>>
>>> I do note, sadly, that currently out of all these implementations, only
>>>>> NetBSD and FreeBSD seem to actually define a separate more optimized bcmp
>>>>> function. That does mean that this optimization would be effectively a
>>>>> no-op, for the vast majority of people.
>>>>>
>>>>
>>>> This might or might not be considered really an issue.
>>>>
>>>
>>> Right, the issue is adding an effectively useless optimization in llvm.
>>>
>>>  - In my original proposal, people have to explicitly opt-in to the
>>>> feature and link to their memcmp implementation, they do not get the
>>>> improvement automatically.
>>>>  - In this proposal, they have to patch their libc, which might be
>>>> slightly more painful depending on the system.
>>>>
>>>
>>> Users may also include a function named bcmp in their binary, which will
>>> overrides the one from libc.
>>>
>>> Here's a patch with this proposal to see what this looks like:
>>>> https://reviews.llvm.org/D56436
>>>>
>>>
>>> It feels like this optimization would be better done in
>>> llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp,
>>>
>>
>>  I'll have a look at this approach.
>>
>
> You're right, that's a better place to do this indeed:
> https://reviews.llvm.org/D56593
>
>
>
>>
>>
>>>
>>>  But if you can show a similar performance win in public code, it'd be
>>>>> great to attempt to push a more optimized version upstream at least to
>>>>> glibc. Some more precise numbers than "very large improvement" are probably
>>>>> necessary to show it's actually worth it. :)
>>>>>
>>>>
>>>> We were planning to contribute it to compiler-rt. Contributing a
>>>> deprecated function to the libc sounds.... weird.
>>>>
>>>
>>> Yes, contributing an optimization for a deprecated function is indeed
>>> weird. Thus the importance of reliable performance numbers justifying the
>>> importance -- I'd never have thought that the performance cost of returning
>>> an ordering from memcmp would be important, and I suspect nobody else did.
>>>
>>
>> Fair enough, let me give some numbers for this change.
>> Before that, a caveat with any benchmarks for comparing strings is that
>> the results depend a lot the distribution of sizes and content of these
>> strings. So it makes more sense to benchmark an actual application, and we
>> have our own custom benchmarks.
>> That being said, one of the cases where we have found this optimization
>> to be impactful is `operator==(const string&, const string&)`. libcxx has a family
>> of benchmarks for `BM_StringRelational_Eq`)
>> <https://github.com/llvm-mirror/libcxx/blob/master/benchmarks/algorithms.bench.cpp>,
>> which I'm going to use here.
>>
>> BM_StringRelational_Eq benchmarks comparison of strings of size 7
>> (Small), 63 (Large) and 5k (Huge) characters, in four scenarios (scenarii
>> ?):
>>  - The equal case (Control), which is theoretically the worst case as you
>> have to prove that all bytes are equal.
>>  - The case when strings differ. In that case bcmp() only needs to prove
>> that one byte differs to return nonzero. Typical cases where strings differ
>> are at the start of the string (ChangeFirst), but also, interestingly, at
>> the end (ChangeLast, when you are comparing strings with a common prefix,
>> which happens frequently e.g. when comparing subsequent elements of a
>> sorted list of strings). Another interesting case is the case when the
>> change position is in the middle (ChangeMiddle).
>>
>> For this comparison, I'm using as base the call to `memcmp`, and as
>> experiment the following crude bcmp() implementation (I'm assuming X86_64),
>> that shows how we can take advantage of the observations above to optimize
>> typical cases.
>>
>> ```
>> #define UNALIGNED_LOAD64(_p) (*reinterpret_cast<const uint64_t *>(_p))
>>
>> extern "C" int bcmp(const void* p1, const void* p2, size_t n) throw() {
>>   const char* a = reinterpret_cast<const char*>(p1);
>>   const char* b = reinterpret_cast<const char*>(p2);
>>   if (n >= 8) {
>>     uint64_t u = UNALIGNED_LOAD64(a) ^ UNALIGNED_LOAD64(b);
>>     uint64_t v = UNALIGNED_LOAD64(a + n - 8) ^ UNALIGNED_LOAD64(b + n -
>> 8);
>>     if ((u | v) != 0) {
>>       return 1;
>>     }
>>   }
>>   return memcmp(a, b, n);
>> }
>> ```
>>
>> Note that:
>>  - there is a bit of noise in the results, but you'll see that this quite
>> dumb bcmp() reasonably improves {Large,Huge}_{ChangeFirst,ChangeLast} (note
>> the the  improvement to {Large,Huge}_ChangeLast cannot be achieved with
>> the semantics of memcmp) without hurting the `ChangeMiddle` and
>> `Control` cases.
>>  - the small string case (size==7) is not modified by our change because
>> there is a "fast path" for very small sizes on operator==.
>>  - We are still experimenting with the final bcmp() implementation (in
>> particular improving the `Control` and `ChangeMiddle` cases by improving
>> parallelism). Our current version is better than memcmp() across the
>> board on X86.
>>
>>
>>                                                 base (ns)   exp (ns)
>> BM_StringRelational_Eq_Empty_Empty_Control      1.65        1.7
>> BM_StringRelational_Eq_Empty_Small_Control      1.37        1.4
>> BM_StringRelational_Eq_Empty_Large_Control      1.37        1.44
>> BM_StringRelational_Eq_Empty_Huge_Control       1.38        1.44
>> BM_StringRelational_Eq_Small_Small_Control      6.53        6.51
>> BM_StringRelational_Eq_Small_Small_ChangeFirst  1.96        1.94
>> BM_StringRelational_Eq_Small_Small_ChangeMiddle 5.06        4.95
>> BM_StringRelational_Eq_Small_Small_ChangeLast   6.77        6.84
>> BM_StringRelational_Eq_Small_Large_Control      1.38        1.41
>> BM_StringRelational_Eq_Small_Huge_Control       1.37        1.39
>> BM_StringRelational_Eq_Large_Large_Control      5.54        5.8
>> BM_StringRelational_Eq_Large_Large_ChangeFirst  6.25        3.06
>> BM_StringRelational_Eq_Large_Large_ChangeMiddle 5.5         5.94
>> BM_StringRelational_Eq_Large_Large_ChangeLast   6.04        3.42
>> BM_StringRelational_Eq_Large_Huge_Control       1.1         1.1
>> BM_StringRelational_Eq_Huge_Huge_Control        118         118
>> BM_StringRelational_Eq_Huge_Huge_ChangeFirst    5.65        3.02
>> BM_StringRelational_Eq_Huge_Huge_ChangeMiddle   69.5        66.9
>> BM_StringRelational_Eq_Huge_Huge_ChangeLast     118         3.43
>>
>>
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190205/a5db76d0/attachment.html>