[libc-commits] [PATCH] D80010: [libc] Add memset and bzero implementations
Siva Chandra via Phabricator via libc-commits
libc-commits at lists.llvm.org
Mon May 18 09:40:06 PDT 2020
sivachandra accepted this revision.
sivachandra added a comment.
This revision is now accepted and ready to land.
LGTM overall.
May be a tad incomplete in the sense that `bzero` will not show up in the public `string.h`. But, I don't think this patch should block on that. `bzero` is essentially an LLVM libc extension at this point. It probably makes sense to add a new standard spec called `llvm-libc-ext` and add all llvm-libc specific functions there and then expose them via public headers.
================
Comment at: libc/src/string/memory_utils/memset_utils.h:20
+template <size_t kBlockSize> static void SetBlock(char *dst, unsigned value) {
+ __builtin_memset(dst, value, kBlockSize);
+}
----------------
gchatelet wrote:
> abrachet wrote:
> > gchatelet wrote:
> > > Theoretically the compiler is allowed to call `memset`here and end up with a recursive call, practically it doesn't happen, however this should be replaced with a `__builtin_memset_inline` once it's available in clang.
> > Of course I can't find it anymore, but we had for `memcpy` a CMake rule to see if the file contained any references to `memcpy` I think. Is that being done here?
> Yes I removed them because it was brittle, also we now have tests that cover all implementations so the stack overflow will be caught.
Can you add your note as a comment here?
================
Comment at: libc/src/string/memory_utils/memset_utils.h:95
+ return SetBlockOverlap<64>(dst, value, count);
+ return SetAlignedBlocks<32>(dst, value, count);
+}
----------------
gchatelet wrote:
> abrachet wrote:
> > Why choose 32 and not 64? If 32 is the best choice here then why does the `count <= 128` condition exist?
> This may look surprising indeed. We want to balance two things here:
> - The number of redundant writes,
> - The number of conditionals for sizes `<=128` : 91% of `memset` calls are for such sizes.
>
> For the range 64-128:
> - `SetBlockOverlap<64>` uses no conditionals but always writes 128 Bytes this is wasteful for 65 but efficient for 128.
> - `SetAlignedBlocks<32>` would consume between 3 and 4 conditionals and write 96 or 128 Bytes.
> - Another approach could be to use an hybrid approach Copy<64>+Overlap<32> for 65-96 and Copy<96>+Overlap<32> for 97-128
>
> Benchmarks showed that redundant writes were cheap (for Intel X86) but conditional were expensive, even on processor that do not support writing 64B at a time (pre-AVX512F). We also want to favor short functions that allow more hot code to fit in the iL1 cache.
>
> Above 128 we have to use conditionals since we don't know the upper bound in advance. `SetAlignedBlocks<64>` may waste up to 63 Bytes, `SetAlignedBlocks<32>` may waste up to 31 Bytes. Benchmarks showed that `SetAlignedBlocks<64>` was not superior for sizes that mattered.
>
> To be honest this part of the implementation is subject to change as we benchmark more processors. We may also want to specialize it for processors with specialized instructions that performs better (`rep stosb`).
Can you add your explanation as a comment here?
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D80010/new/
https://reviews.llvm.org/D80010
More information about the libc-commits
mailing list