[libc-commits] [PATCH] D80010: [libc] Add memset and bzero implementations

Mon May 18 06:57:14 PDT 2020

gchatelet marked 11 inline comments as done.
gchatelet added inline comments.

================
Comment at: libc/src/string/memory_utils/memset_utils.h:20
+template <size_t kBlockSize> static void SetBlock(char *dst, unsigned value) {
+  __builtin_memset(dst, value, kBlockSize);
+}
----------------
abrachet wrote:
> gchatelet wrote:
> > Theoretically the compiler is allowed to call `memset`here and end up with a recursive call, practically it doesn't happen, however this should be replaced with a `__builtin_memset_inline` once it's available in clang.
> Of course I can't find it anymore, but we had for `memcpy` a CMake rule to see if the file contained any references to `memcpy` I think. Is that being done here? 
Yes I removed them because it was brittle, also we now have tests that cover all implementations so the stack overflow will be caught.

================
Comment at: libc/src/string/memory_utils/memset_utils.h:95
+    return SetBlockOverlap<64>(dst, value, count);
+  return SetAlignedBlocks<32>(dst, value, count);
+}
----------------
abrachet wrote:
> Why choose 32 and not 64? If 32 is the best choice here then why does the `count <= 128` condition exist?
This may look surprising indeed. We want to balance two things here:
 - The number of redundant writes,
 - The number of conditionals for sizes `<=128` : 91% of `memset` calls are for such sizes.

For the range 64-128:
 - `SetBlockOverlap<64>` uses no conditionals but always writes 128 Bytes this is wasteful for 65 but efficient for 128.
 - `SetAlignedBlocks<32>` would consume between 3 and 4 conditionals and write 96 or 128 Bytes.
 - Another approach could be to use an hybrid approach Copy<64>+Overlap<32> for 65-96 and Copy<96>+Overlap<32> for 97-128

Benchmarks showed that redundant writes were cheap (for Intel X86) but conditional were expensive, even on processor that do not support writing 64B at a time (pre-AVX512F). We also want to favor short functions that allow more hot code to fit in the iL1 cache.

Above 128 we have to use conditionals since we don't know the upper bound in advance. `SetAlignedBlocks<64>` may waste up to 63 Bytes, `SetAlignedBlocks<32>` may waste up to 31 Bytes. Benchmarks showed that `SetAlignedBlocks<64>` was not superior for sizes that mattered.

To be honest this part of the implementation is subject to change as we benchmark more processors. We may also want to specialize it for processors with specialized instructions that performs better (`rep stosb`).

================
Comment at: libc/test/src/string/bzero_test.cpp:39
       for (size_t i = 0; i < count; ++i)
-        ASSERT_EQ(buffer[align + i], groundtruth[i]);
+        ASSERT_EQ(buffer[align + i], char(0));
       // Everything after copy is untouched.
----------------
abrachet wrote:
> There was recently a patch which changed all uses of functional style casts. I don't know if its in the style guide to not use them or not. But maybe it's better to use `(char) 0` or `'\0'`?
This is not a functional cast but a constructor of type `char` with value `0`.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80010/new/

https://reviews.llvm.org/D80010