[libc-commits] [PATCH] D74397: [libc] Adding memcpy implementation for x86_64
Guillaume Chatelet via Phabricator via libc-commits
libc-commits at lists.llvm.org
Wed Feb 12 02:42:45 PST 2020
gchatelet added a comment.
In D74397#1870989 <https://reviews.llvm.org/D74397#1870989>, @abrachet wrote:
> I've gone ahead and copy and pasted some of this into godbolt to save others some time if they wish to play around https://godbolt.org/z/z4dCmj :)
Thx : )
One should add `-fno-builtin-memcpy`
Also it's worth playing with `-mno-avx` or `-mavx512f` to see the difference in codegen.
>> How do we build? We may want to test in debug but build the libc with -march=native for instance,
>
> This is logical. To my knowledge we don't currently do anything special when CMAKE_BUILD_TYPE=Release but this makes sense to turn on for release and benchmarking.
Yes I think we want to get the most out of the architecture we target (see the difference in codegen depending on available features avx / avx512f)
>> With gcc we can use __builtin_memcpy but then we'd need a postprocess step to check that the final assembly do not contain call to memcpy (unlikely but allowed),
>
> I think we can do this in cmake with `add_custom_command` and `POST_BUILD` specified then just `nm $<TARGET_FILE:target> | grep "U memcpy"`
Thx that's useful
In D74397#1871338 <https://reviews.llvm.org/D74397#1871338>, @sivachandra wrote:
> > How do we customize the implementation? (i.e. how to define `kRepMovsBSize`),
>
> What kind of customizations?
The `rep movsb` instruction performance is highly tied to the targeted microarchitecture.
For one there is the ERMS cpuid flag <https://en.wikipedia.org/wiki/CPUID#EAX=7,_ECX=0:_Extended_Features> that helps to know if it should be used at all.
Then depending on the microarchitecture the crosspoint between aligned copy and `rep movsb` varies between 512 to a few kilobytes.
Ideally we need to adapt this threshold somehow to provide the best implementation.
Eventually, when `rep movsb` becomes excellent we can replace the function entirely with this single instruction.
I understand that `llvm-libc` is to be a pick and choose what you need libc which implies some sort of customization anyways. Am I right?
>
>
>> - How do we specify custom compilation flags? (We'd need `-fno-builtin-memcpy` to be passed in),
>
> Do we need to pass this flag for building llvm-libc, or to user code (and llvm-libc tests?) Reading the code, it seems to me that it is the latter, correct?
It is solely for this specific compilation unit
When using `clang` we can use `__attribute__((no_builtin("memcpy")))` <https://clang.llvm.org/docs/AttributeReference.html#no-builtin> but it won't work with gcc.
>
>
>> How do we build? We may want to test in debug but build the libc with `-march=native` for instance,
>
> Not sure I understand this fully. What are the use cases/goals of what you are describing here?
See above, the generated code is improved (smaller and faster) when targeting specific architecture.
glibc is using IFUNC <https://sourceware.org/glibc/wiki/GNU_IFUNC> to that matter and lets the runtime pick the best implementation.
Although feasible in llvm-libc we noticed that the required extra level of indirection is hurting branch prediction and latency for small sizes (which are the most frequent <https://github.com/llvm/llvm-project/tree/master/libc/utils/benchmarks#benchmarking-regimes>)
>> Clang has a brand new builtin `__builtin_memcpy_inline` which makes the implementation easy and efficient, but:
>>
>> - If we compile with `gcc` or `msvc` we can't use it, resorting on less efficient code generation,
>
> Less efficient wrt clang compiled code? Can we rephrase what you are saying as, "we make use of a better optimization when compiled with clang?"
Ok
>
>
>> - With gcc we can use `__builtin_memcpy` but then we'd need a postprocess step to check that the final assembly do not contain call to `memcpy` (unlikely but allowed),
>
> The concern is that it will become a recursive call?
Yes this is technically possible but I've never seen it in practice,
It is highly unlikely that a compiler would generate a call to `memcpy` for sizes `<=64B`.
> What action is to be taken if we do find a call to `memcpy` in the final assembly?
I believe we should just refuse to compile on such a compiler if it happens.
>> - For msvc we'd need to resort on the compiler optimization passes.
>
> Does "we" mean llvm-libc developers?
Yes, sorry for not being clear here.
> A related question: considering we are using `__builtin_memcpy` and inline assembly, does the code work as is with MSVC?
No it doesn't, I can add a fallback case if you want but the generated code is not good right now.
This means we'd need to specialize the `CopyRepMovsb` function so it uses the correct syntax for msvc `__asm` instead of the provided `LIBC_INLINE_ASM`
The patch is to get the conversation started and is not a full just implementation yet (although it is functional).
Quick question @sivachandra , how do we currently build `llvm-libc`?
`utils/build_scripts` <https://github.com/llvm/llvm-project/blob/master/libc/docs/source_layout.rst#the-utilsbuild_scripts-directory> seems to be missing so I could only build the entrypoints via transitive dependency through tests.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D74397/new/
https://reviews.llvm.org/D74397
More information about the libc-commits
mailing list