[libc-commits] [PATCH] D74397: [libc] Adding memcpy implementation for x86_64

Wed Mar 11 09:04:40 PDT 2020

gchatelet added a comment.

I think we misunderstood each other.

My original goal was to also support shared library with runtime dispatch (postponed but we'll do it eventually)
In that case we need a single library with all the implementations, they need to have different names to prevent ODR, and `memcpy` would be the dispatching function.
That's why I provided the custom definitions `MEMCPY_NAME=${memcpy_name}`.

Now I can indeed remove all of this, have different targets generating different object files all with the same `memcpy` function in it. This means that I need to create X unit tests and X benchmarks all linking with a different targets but all referring to the same `memcpy` function (just a different implementation). This works, and I've updated the patch accordingly.

However it will not be straightforward to go from this to the "shared library with runtime dispatch" state. We would need extra machinery to copy the `memcpy` functions of each implementations into the final object and rename them to prevent ODR.

The current implementation is simpler though. Let me know what you think.

================
Comment at: libc/src/string/CMakeLists.txt:78
+  compile_memcpy_with_flags(avx512 FLAGS -mavx512f)
+endif()
+
----------------
MaskRay wrote:
> sivachandra wrote:
> > gchatelet wrote:
> > > sivachandra wrote:
> > > > gchatelet wrote:
> > > > > gchatelet wrote:
> > > > > > sivachandra wrote:
> > > > > > > abrachet wrote:
> > > > > > > > sivachandra wrote:
> > > > > > > > > gchatelet wrote:
> > > > > > > > > > This now creates the following memcpy implementations
> > > > > > > > > >  - `__llvm_libc::memcpy_x86_64_avx512`
> > > > > > > > > >  - `__llvm_libc::memcpy_x86_64_avx`
> > > > > > > > > >  - `__llvm_libc::memcpy_x86_64_sse2`
> > > > > > > > > >  - `__llvm_libc::memcpy_x86_64_sse`
> > > > > > > > > >  - `__llvm_libc::memcpy_x86_64_unopt`
> > > > > > > > > > 
> > > > > > > > > > For shared libc, we need an ifunc like trampoline to select the correct version.
> > > > > > > > > > For static libc, we need to select an implementation
> > > > > > > > > > 
> > > > > > > > > > @sivachandra how do you see this kind if code generation integrate into the more general cmake functions from `libc/cmake/modules/LLVMLibCRules.cmake`?
> > > > > > > > > > I expect other memory functions to follow the same scheme.
> > > > > > > > > Instead of building all the possible implementations, could we use the CMake [[ https://cmake.org/cmake/help/v3.14/command/try_compile.html | `try_compile` ]] and/or [[ https://cmake.org/cmake/help/v3.14/command/try_run.html | `try_run` ]] command to sniff out the best flags to use? I think `try_run` is more appropriate as I expect that we need to run the `cpuid` instruction? Also, compilers have a convenience macro `__cpuid` to run this instruction on x86/x86_64?
> > > > > > > > > 
> > > > > > > > > BTW, one can have ifuncs in static libraries as well. But, I do understand we want to avoid the overhead of the added indirection, so sniffing out at configure time is the best. If we can setup something for configure time sniffing, I believe we should be able use it (may be with straightforward extension/modification iff required) to use as the ifunc selector as well.
> > > > > > > > I think this is going to get quickly outside of the scope of this patch. It would probably be better to work on this in a separate patch.
> > > > > > > If there are any infrastructure pieces required to build a full solution, then we can do them separately as a prerequisite. However, if correct compile options are critical to memcpy implementation, then any code added to deduce them should belong to this patch. I agree that the patch will start to become too big. But, I think the additions to memcpy_utils and their tests can be split out to another prerequisite patch.
> > > > > > @sivachandra `try_compile` and `try_run` will only work when the host is also the target of the build. For instance it won't work when cross compiling.
> > > > > > 
> > > > > > There are a few dimensions here:
> > > > > > 1. static library
> > > > > >     - autodetect which implementation to use with `try_compile` / `try_run`
> > > > > >     - manually choose the implementation for cross compilation
> > > > > > 2. dynamic library
> > > > > >     - select a set of implementations to include in the library and the associated dispatch code
> > > > > > 3. test
> > > > > >     - test all the implementations that can run on the host
> > > > > > 
> > > > > > I agree with both of you that this is getting too big for this patch.
> > > > > > Let me do my homework and come back with the prerequisite patches.
> > > > > Let me think about it, I'll get back with the required building blocks.
> > > > We don't need to worry about #2 for now.
> > > > 
> > > > For cross-compilation in case of #1, if we set up the build vars appropriately, I believe it should not be a problem.
> > > > 
> > > > #3 is currently a problem because of the way the `add_entrypoint_object` rule works: it expects the entrypoint name and the target name to be the same.  We can add an optional argument to the rule which explicitly provides the entrypoint name. This will allow us to add multiple targets for the same entrypoint. We can setup our targets as follows:
> > > > 
> > > > ```
> > > > src/string/
> > > >     - x86_64 # x86_64 specific directory.
> > > >           - CMakeLists.txt # Lists the targets for the various
> > > >                            # x86_64 flavors which all use the
> > > >                            # single memcpy.cpp source file
> > > >     - CMakeLists.txt # Lists the target for the release version
> > > >                      # of memcpy built using the best compile
> > > >                      # for the target machine.
> > > >     - memcpy.cpp # The actual platform independent memcpy
> > > >                  # implementation.
> > > > ```
> > > > 
> > > > The same structure can be followed for tests as well. WDYT?
> > > @sivachandra SGTM
> > > Can you try to amend `add_entrypoint_object` in a separate patch?
> > > 
> > > Separately, I came up with a CMake function to test cpu flags on linux/mac this will help for `3` and `2`.
> > > 
> > > `x86` and `x86_64`will have the same implementation any suggestion on how to name the directory? `x86_32_64` ?
> > Done now: https://reviews.llvm.org/D74948
> > 
> > Conventionally, pieces common to x86_64 and other x86 flavors are all put together in a directory named `x86`. I do not have an easy to way to test on non-x86_64 machines. So, may be just start with `x86_64` for now and leave a comment that it works for other x86 flavors also? If you have a better way, feel free to propose.
> Does `__attribute__((target("avx")))` meet the needs?
> 
> As to ifunc, it needs non-trivial work in the rtld. Even in a -static (but not -static-pie) context, there will be R_X86_64_IRELATIVE relocations and crt should resolve them.
> Does `__attribute__((target("avx")))` meet the needs?

Kind of, but
 - It's brittle, I've been bitten a few times passing features with typo and the compiler will happily compile without a warning.
 - It won't work with MSVC

For these reasons I'd rather not use it.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D74397/new/

https://reviews.llvm.org/D74397