[libc-commits] [PATCH] D114637: [libc] Optimized version of memmove

Mon Jan 10 07:53:27 PST 2022

gchatelet added a comment.

In D114637#3199471 <https://reviews.llvm.org/D114637#3199471>, @avieira wrote:

> Hi @gchatelet,
>
> I'm working on an aarch64 optimised version and I came across something that might be of use to you too. I found that the Repeated implementation of Move was yielding sub-optimal code in the large loop, it would load a _64 element in reverse (last 32-bytes first), I believe this was a side-effect of how it was stacking the loads and stores in opposite order like:
> Load (src)
> Load (src + 8)
> Load (src + 16)
> Load (src + 32)
> Store (src + 32)
> Store (src + 16)
> ...

Do you have an idea of why this is yielding suboptimal results?
In the code I generated for x86-64, using this pyramid shape offset pattern reduced the number of instructions (the compiler could outline the last store across different functions).
I'm not sure this translated into better performance though, only slightly smaller function size.

> I found that changing the implementation of the Repeated Move to a for-loop of loads followed by a for-loop of stores from 0 to ElementCount solved it and gave me a speed up on larger memmoves.

Could you share the resulting asm?

================
Comment at: libc/test/src/string/memmove_test.cpp:53
+      const size_t size = expected.size();
+      Display(
+          size,
----------------
sivachandra wrote:
> For things like this, you should use a matcher. Look at this for example: https://github.com/llvm/llvm-project/blob/main/libc/utils/UnitTest/FPMatcher.h#L26.
> 
> You can probably implement a matcher which can used like this: `EXPECT_MEM_EQ(mem1, mem2, size)`
I've been lazy :D Thx for pointing this out.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114637/new/

https://reviews.llvm.org/D114637