[libc-commits] [PATCH] D80010: [libc] Add memset and bzero implementations

Mon May 18 09:40:06 PDT 2020

sivachandra accepted this revision.
sivachandra added a comment.
This revision is now accepted and ready to land.

LGTM overall.

May be a tad incomplete in the sense that `bzero` will not show up in the public `string.h`. But, I don't think this patch should block on that. `bzero` is essentially an LLVM libc extension at this point. It probably makes sense to add a new standard spec called `llvm-libc-ext` and add all llvm-libc specific functions there and then expose them via public headers.

================
Comment at: libc/src/string/memory_utils/memset_utils.h:20
+template <size_t kBlockSize> static void SetBlock(char *dst, unsigned value) {
+  __builtin_memset(dst, value, kBlockSize);
+}
----------------
gchatelet wrote:
> abrachet wrote:
> > gchatelet wrote:
> > > Theoretically the compiler is allowed to call `memset`here and end up with a recursive call, practically it doesn't happen, however this should be replaced with a `__builtin_memset_inline` once it's available in clang.
> > Of course I can't find it anymore, but we had for `memcpy` a CMake rule to see if the file contained any references to `memcpy` I think. Is that being done here? 
> Yes I removed them because it was brittle, also we now have tests that cover all implementations so the stack overflow will be caught.
Can you add your note as a comment here?

================
Comment at: libc/src/string/memory_utils/memset_utils.h:95
+    return SetBlockOverlap<64>(dst, value, count);
+  return SetAlignedBlocks<32>(dst, value, count);
+}
----------------
gchatelet wrote:
> abrachet wrote:
> > Why choose 32 and not 64? If 32 is the best choice here then why does the `count <= 128` condition exist?
> This may look surprising indeed. We want to balance two things here:
>  - The number of redundant writes,
>  - The number of conditionals for sizes `<=128` : 91% of `memset` calls are for such sizes.
> 
> For the range 64-128:
>  - `SetBlockOverlap<64>` uses no conditionals but always writes 128 Bytes this is wasteful for 65 but efficient for 128.
>  - `SetAlignedBlocks<32>` would consume between 3 and 4 conditionals and write 96 or 128 Bytes.
>  - Another approach could be to use an hybrid approach Copy<64>+Overlap<32> for 65-96 and Copy<96>+Overlap<32> for 97-128
> 
> Benchmarks showed that redundant writes were cheap (for Intel X86) but conditional were expensive, even on processor that do not support writing 64B at a time (pre-AVX512F). We also want to favor short functions that allow more hot code to fit in the iL1 cache.
> 
> Above 128 we have to use conditionals since we don't know the upper bound in advance. `SetAlignedBlocks<64>` may waste up to 63 Bytes, `SetAlignedBlocks<32>` may waste up to 31 Bytes. Benchmarks showed that `SetAlignedBlocks<64>` was not superior for sizes that mattered.
> 
> To be honest this part of the implementation is subject to change as we benchmark more processors. We may also want to specialize it for processors with specialized instructions that performs better (`rep stosb`).
Can you add your explanation as a comment here?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80010/new/

https://reviews.llvm.org/D80010