[libc-commits] [PATCH] D74397: [libc] Adding memcpy implementation for x86_64

Wed Feb 12 02:42:45 PST 2020

gchatelet added a comment.

In D74397#1870989 <https://reviews.llvm.org/D74397#1870989>, @abrachet wrote:

> I've gone ahead and copy and pasted some of this into godbolt to save others some time if they wish to play around https://godbolt.org/z/z4dCmj :)

Thx : )
One should add `-fno-builtin-memcpy`
Also it's worth playing with  `-mno-avx` or `-mavx512f` to see the difference in codegen.

>> How do we build? We may want to test in debug but build the libc with -march=native for instance,
> 
> This is logical. To my knowledge we don't currently do anything special when CMAKE_BUILD_TYPE=Release but this makes sense to turn on for release and benchmarking.

Yes I think we want to get the most out of the architecture we target (see the difference in codegen depending on available features avx / avx512f)

>> With gcc we can use __builtin_memcpy but then we'd need a postprocess step to check that the final assembly do not contain call to memcpy (unlikely but allowed),
> 
> I think we can do this in cmake with `add_custom_command` and `POST_BUILD` specified then just `nm $<TARGET_FILE:target> | grep "U memcpy"`

Thx that's useful

In D74397#1871338 <https://reviews.llvm.org/D74397#1871338>, @sivachandra wrote:

> > How do we customize the implementation? (i.e. how to define `kRepMovsBSize`),
>
> What kind of customizations?

The `rep movsb` instruction performance is highly tied to the targeted microarchitecture.
For one there is the ERMS cpuid flag <https://en.wikipedia.org/wiki/CPUID#EAX=7,_ECX=0:_Extended_Features> that helps to know if it should be used at all.
Then depending on the microarchitecture the crosspoint between aligned copy and `rep movsb` varies between 512 to a few kilobytes.
Ideally we need to adapt this threshold somehow to provide the best implementation.

Eventually, when `rep movsb` becomes excellent we can replace the function entirely with this single instruction.

I understand that `llvm-libc` is to be a pick and choose what you need libc which implies some sort of customization anyways. Am I right?

> 
> 
>> - How do we specify custom compilation flags? (We'd need `-fno-builtin-memcpy` to be passed in),
> 
> Do we need to pass this flag for building llvm-libc, or to user code (and llvm-libc tests?) Reading the code, it seems to me that it is the latter, correct?

It is solely for this specific compilation unit
When using `clang` we can use `__attribute__((no_builtin("memcpy")))` <https://clang.llvm.org/docs/AttributeReference.html#no-builtin> but it won't work with gcc.

> 
> 
>> How do we build? We may want to test in debug but build the libc with `-march=native` for instance,
> 
> Not sure I understand this fully. What are the use cases/goals of what you are describing here?

See above, the generated code is improved (smaller and faster) when targeting specific architecture.
glibc is using IFUNC <https://sourceware.org/glibc/wiki/GNU_IFUNC> to that matter and lets the runtime pick the best implementation.
Although feasible in llvm-libc we noticed that the required extra level of indirection is hurting branch prediction and latency for small sizes (which are the most frequent <https://github.com/llvm/llvm-project/tree/master/libc/utils/benchmarks#benchmarking-regimes>)

>> Clang has a brand new builtin `__builtin_memcpy_inline` which makes the implementation easy and efficient, but:
>> 
>> - If we compile with `gcc` or `msvc` we can't use it, resorting on less efficient code generation,
> 
> Less efficient wrt clang compiled code? Can we rephrase what you are saying as, "we make use of a better optimization when compiled with clang?"

Ok

> 
> 
>> - With gcc we can use `__builtin_memcpy` but then we'd need a postprocess step to check that the final assembly do not contain call to `memcpy` (unlikely but allowed),
> 
> The concern is that it will become a recursive call?

Yes this is technically possible but I've never seen it in practice,
It is highly unlikely that a compiler would generate a call to `memcpy` for sizes `<=64B`.

> What action is to be taken if we do find a call to `memcpy` in the final assembly?

I believe we should just refuse to compile on such a compiler if it happens.

>> - For msvc we'd need to resort on the compiler optimization passes.
> 
> Does "we" mean llvm-libc developers?

Yes, sorry for not being clear here.

> A related question: considering we are using `__builtin_memcpy` and inline assembly, does the code work as is with MSVC?

No it doesn't, I can add a fallback case if you want but the generated code is not good right now.
This means we'd need to specialize the `CopyRepMovsb` function so it uses the correct syntax for msvc `__asm` instead of the provided `LIBC_INLINE_ASM`

The patch is to get the conversation started and is not a full just implementation yet (although it is functional).

Quick question @sivachandra , how do we currently build `llvm-libc`?
`utils/build_scripts` <https://github.com/llvm/llvm-project/blob/master/libc/docs/source_layout.rst#the-utilsbuild_scripts-directory> seems to be missing so  I could only build the entrypoints via transitive dependency through tests.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D74397/new/

https://reviews.llvm.org/D74397