[libc-commits] [PATCH] D114637: [libc] Optimized version of memmove

Fri Dec 17 02:26:35 PST 2021

avieira added a comment.

Hi @gchatelet,

I'm working on an aarch64 optimised version and I came across something that might be of use to you too. I found that the Repeated implementation of Move was yielding sub-optimal code in the large loop, it would load a _64 element in reverse (last 32-bytes first), I believe this was a side-effect of how it was stacking the loads and stores in opposite order like:
Load (src)
Load (src + 8)
Load (src + 16)
Load (src + 32)
Store (src + 32)
Store (src + 16)
...

I found that changing the implementation of the Repeated Move to a for-loop of loads followed by a for-loop of stores from 0 to ElementCount solved it and gave me a speed up on larger memmoves.

One of the things I am looking at now is weaving the loads and stores, as we've seen some improvements in some of our cores using interleaved series of LDP-STPs(64-byte loads and stores). Obviously that means that for backwards copies I need to do the end ones first so I'd probably only do it for the loop and use an aarch64-only elements one with a forwards weaving _64::Move and one backwards.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114637/new/

https://reviews.llvm.org/D114637