[libc-commits] [PATCH] D114637: [libc] Optimized version of memmove
Andre Vieira via Phabricator via libc-commits
libc-commits at lists.llvm.org
Fri Dec 17 02:26:35 PST 2021
avieira added a comment.
I'm working on an aarch64 optimised version and I came across something that might be of use to you too. I found that the Repeated implementation of Move was yielding sub-optimal code in the large loop, it would load a _64 element in reverse (last 32-bytes first), I believe this was a side-effect of how it was stacking the loads and stores in opposite order like:
Load (src + 8)
Load (src + 16)
Load (src + 32)
Store (src + 32)
Store (src + 16)
I found that changing the implementation of the Repeated Move to a for-loop of loads followed by a for-loop of stores from 0 to ElementCount solved it and gave me a speed up on larger memmoves.
One of the things I am looking at now is weaving the loads and stores, as we've seen some improvements in some of our cores using interleaved series of LDP-STPs(64-byte loads and stores). Obviously that means that for backwards copies I need to do the end ones first so I'd probably only do it for the loop and use an aarch64-only elements one with a forwards weaving _64::Move and one backwards.
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
More information about the libc-commits