[PATCH] D32971: [scudo] CRC32 optimizations

Mon May 8 13:50:22 PDT 2017

dvyukov added inline comments.

================
Comment at: lib/scudo/scudo_allocator.cpp:39
-
-INLINE u32 computeCRC32(u32 Crc, uptr Data, u8 HashType) {
-  // If SSE4.2 is defined here, it was enabled everywhere, as opposed to only
----------------
cryptoad wrote:
> dvyukov wrote:
> > cryptoad wrote:
> > > cryptoad wrote:
> > > > dvyukov wrote:
> > > > > cryptoad wrote:
> > > > > > dvyukov wrote:
> > > > > > > It looked nice to have a separate function that wraps all of this logic. Why do you inline it?
> > > > > > > I would just move the load of HashAlgorithm into this function so that it does not happen when not needed.
> > > > > > Indeed, I considered this as well.
> > > > > > The other advantage of doing it the new way was to have the `crc32` instructions inlined as opposed to be calls (which is another part of the performance gain), which required me to use the intrinsic within the loop. After duplicating the loop with intrinsic, I didn't see a need to keep that function.
> > > > > Do you can have intrinsics in the inlinable computeCRC32 function. It should lead to the same assembly.
> > > > > With intrinsics it becomes more complex, so there is even more reason to have a separate function for it. Consider that we want to calculate crc in a another place.
> > > > So something like:
> > > > ```
> > > > INLINE u32 computeCRC32(u32 Crc, uptr Data, u8 HashType) {
> > > > #if defined(__SSE4_2__) || defined(__ARM_FEATURE_CRC32)
> > > >   return CRC32_INTRINSIC(Crc, Data);
> > > > #else
> > > >   if (atomic_load_relaxed(&HashAlgorithm) == CRC32Hardware)
> > > >     return computeHardwareCRC32(Crc, Data);
> > > >   return computeSoftwareCRC32(Crc, Data);
> > > > #endif
> > > > }
> > > > ```
> > > (without the 3rd param)
> > yup
> Just gave it a try, for the non-full-SSE the loop is unrolled but with a load & check of the algorithm at each iteration (2 for 64-bit, 3 for 32-bit):
> 
> ```
>                 mov     rax, cs:_ZN7__scudoL6CookieE ; __scudo::Cookie
>                 mov     dl, cs:_ZN7__scudoL13HashAlgorithmE ; unsigned __int64
>                 cmp     dl, 1
>                 jnz     short loc_884B
>                 mov     edi, eax        ; this
>                 mov     rsi, rcx        ; unsigned int
>                 call    _ZN7__scudo20computeHardwareCRC32Ejm ; __scudo::computeHardwareCRC32(uint,ulong)
>                 jmp     loc_88DB
> loc_884B:                               ; CODE XREF: __scudo::ScudoChunk::computeChecksum(__scudo::UnpackedHeader *)+1Aj
> 
> ... software crc32 ...
> 
> loc_88DB:                               ; CODE XREF: __scudo::ScudoChunk::computeChecksum(__scudo::UnpackedHeader *)+26j
>                 mov     cl, cs:_ZN7__scudoL13HashAlgorithmE ; __scudo::HashAlgorithm
>                 cmp     cl, 1
>                 jnz     short loc_88FC
>                 and     r14, 0FFFFFFFFFFFF0000h
>                 mov     edi, eax        ; this
>                 mov     rsi, r14        ; unsigned int
>                 call    _ZN7__scudo20computeHardwareCRC32Ejm ; __scudo::computeHardwareCRC32(uint,ulong)
>                 jmp     loc_8982
> loc_88FC:                               ; CODE XREF: __scudo::ScudoChunk::computeChecksum(__scudo::UnpackedHeader *)+C4j
> 
> ... software crc32 ...
> ```
> 
> It seems somewhat messier. What do you think?
I missed that this function consumes a single word rather than an array.
I basically mean a function which you give initial crc value and an array of data and it gives you new crc value using the most efficient method available whatever it is.

https://reviews.llvm.org/D32971