[lldb-dev] Improve performance of crc32 calculation

Tue Apr 18 11:06:47 PDT 2017

Thank you for that clarification.  Sounds like we can't change the crc code
then.

I realized I had been using GNU's gold linker.  I switched to linking with
lld(-4.0) and now linking uses less than 1/3rd the cpu.  It seems that the
default hashing (fast == xxHash) is faster than whatever gold was using.
I'll just switch to that and call it a day.

On Tue, Apr 18, 2017 at 5:46 AM, Pavel Labath <labath at google.com> wrote:

> What we need is the ability to connect a stripped version of an SO to one
> with debug symbols present. Currently there are (at least) two ways to
> achieve that:
>
> - build-id: both SOs have a build-id section with the same value.
> Normally, that's added by a linker in the final link, and subsequent strip
> steps do not remove it. Normally the build-id is some sort of a hash of the
> *initial* file contents, which is why you feel like you are trading
> debugger startup time for link time. However, that is not a requirement, as
> the exact checksumming algorithm does not matter here. A random byte
> sequence would do just fine, which is what "--build-id=uuid" does and it
> should have no impact on your link time. Be sure **not** to use this if you
> care about deterministic builds though.
>
> - gnu_debuglink: here, the stripped SO contains a checksum of the original
> SO, which is added at strip time. This is done using a fixed algorithm, and
> this is important as the debugger needs to arrive at the same checksum as
> the strip tool. Also worth noting is that this mechanism embeds the path of
> the original SO into the stripped one, whereas the first one leaves the
> search task up to the debugger. This may be a plus or a minus, depending on
> your use case.
>
> Hope that makes things a bit clearer. Cheers,
> pl
>
>
> On 13 April 2017 at 18:31, Scott Smith <scott.smith at purestorage.com>
> wrote:
>
>> Interesting.  That saves lldb startup time (after crc
>> improvements/parallelization) by about 1.25 seconds wall clock / 10 seconds
>> cpu time, but increases linking by about 2 seconds of cpu time (and an
>> inconsistent amount of wall clock time).  That's only a good tradeoff if
>> you run the debugger a lot.
>>
>> If all you need is a unique id, there are cheaper ways of going about
>> it.  The SSE crc instruction would be cheaper, or using CityHash/MurmurHash
>> for other cpus.  I thought it was specifically tied to that crc algorithm.
>> In that case it doesn't make sense to fold this into JamCRC, since that's
>> tied to a difficult-to-optimize algorithm.
>>
>> On Thu, Apr 13, 2017 at 4:28 AM, Pavel Labath <labath at google.com> wrote:
>>
>>> Improving the checksumming speed is definitely a worthwhile
>>> contribution, but be aware that there is a pretty simple way to avoid
>>> computing the crc altogether, and that is to make sure your binaries have a
>>> build ID. This is generally as simple as adding -Wl,--build-id to your
>>> compiler flags.
>>>
>>> +1 to moving the checksumming code to llvm
>>>
>>> pl
>>>
>>> On 13 April 2017 at 07:20, Zachary Turner via lldb-dev <
>>> lldb-dev at lists.llvm.org> wrote:
>>>
>>>> I know this is outside of your initial goal, but it would be really
>>>> great if JamCRC be updated in llvm to be parallel. I see that you're making
>>>> use of TaskRunner for the parallelism, but that looks pretty generic, so
>>>> perhaps that could be raised into llvm as well if it helps.
>>>>
>>>> Not trying to throw extra work on you, but it seems like a really good
>>>> general purpose improvement and it would be a shame if only lldb can
>>>> benefit from it.
>>>> On Wed, Apr 12, 2017 at 8:35 PM Scott Smith via lldb-dev <
>>>> lldb-dev at lists.llvm.org> wrote:
>>>>
>>>>> Ok I stripped out the zlib crc algorithm and just left the parallelism
>>>>> + calls to zlib's crc32_combine, but only if we are actually linking with
>>>>> zlib.  I left those calls here (rather than folding them info JamCRC)
>>>>> because I'm taking advantage of TaskRunner to parallelize the work.
>>>>>
>>>>> I moved the system include block after the llvm includes, both because
>>>>> I had to (to use the config #defines), and because it fit the published
>>>>> coding convention.
>>>>>
>>>>> By itself, it reduces my test time from 55 to 47 seconds. (The
>>>>> original time is slower than before because I pulled the latest code, guess
>>>>> there's another slowdown to fix).
>>>>>
>>>>> On Wed, Apr 12, 2017 at 12:15 PM, Scott Smith <
>>>>> scott.smith at purestorage.com> wrote:
>>>>>
>>>>>> The algorithm included in ObjectFileELF.cpp performs a byte at a time
>>>>>> computation, which causes long pipeline stalls in modern processors.
>>>>>> Unfortunately, the polynomial used is not the same one used by the SSE 4.2
>>>>>> instruction set, but there are two ways to make it faster:
>>>>>>
>>>>>> 1. Work on multiple bytes at a time, using multiple lookup tables.
>>>>>> (see http://create.stephan-brumme.com/crc32/#slicing-by-8-overview)
>>>>>> 2. Compute crcs over separate regions in parallel, then combine the
>>>>>> results.  (see http://stackoverflow.com/quest
>>>>>> ions/23122312/crc-calculation-of-a-mostly-static-data-stream)
>>>>>>
>>>>>> As it happens, zlib provides functions for both:
>>>>>> 1. The zlib crc32 function uses the same polynomial as
>>>>>> ObjectFileELF.cpp, and uses slicing-by-4 along with loop unrolling.
>>>>>> 2. The zlib library provides crc32_combine.
>>>>>>
>>>>>> I decided to just call out to the zlib library, since I see my
>>>>>> version of lldb already links with zlib; however, the llvm CMakeLists.txt
>>>>>> declares it optional.
>>>>>>
>>>>>> I'm including my patch that assumes zlib is always linked in.  Let me
>>>>>> know if you prefer:
>>>>>> 1. I make the change conditional on having zlib (i.e. fall back to
>>>>>> the old code if zlib is not present)
>>>>>> 2. I copy all the code from zlib and put it in ObjectFileELF.cpp.
>>>>>> However, I'm going to guess that requires updating some documentation to
>>>>>> include zlib's copyright notice.
>>>>>>
>>>>>> This brings startup time on my machine / my binary from 50 seconds
>>>>>> down to 32.
>>>>>> (time ~/llvm/build/bin/lldb -b -o 'b main' -o 'run' MY_PROGRAM)
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> lldb-dev mailing list
>>>>> lldb-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
>>>>>
>>>>
>>>> _______________________________________________
>>>> lldb-dev mailing list
>>>> lldb-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-dev/attachments/20170418/52ff5b8b/attachment.html>