[libc-commits] [PATCH] D114637: [libc] Optimized version of memmove

Andre Vieira via Phabricator via libc-commits libc-commits at lists.llvm.org
Fri Dec 17 02:26:35 PST 2021

avieira added a comment.

Hi @gchatelet,

I'm working on an aarch64 optimised version and I came across something that might be of use to you too. I found that the Repeated implementation of Move was yielding sub-optimal code in the large loop, it would load a _64 element in reverse (last 32-bytes first), I believe this was a side-effect of how it was stacking the loads and stores in opposite order like:
Load (src)
Load (src + 8)
Load (src + 16)
Load (src + 32)
Store (src + 32)
Store (src + 16)

I found that changing the implementation of the Repeated Move to a for-loop of loads followed by a for-loop of stores from 0 to ElementCount solved it and gave me a speed up on larger memmoves.

One of the things I am looking at now is weaving the loads and stores, as we've seen some improvements in some of our cores using interleaved series of LDP-STPs(64-byte loads and stores). Obviously that means that for backwards copies I need to do the end ones first so I'd probably only do it for the loop and use an aarch64-only elements one with a forwards weaving _64::Move and one backwards.

  rG LLVM Github Monorepo



More information about the libc-commits mailing list