[libc-commits] [PATCH] D114637: [libc] Optimized version of memmove
Andre Vieira via Phabricator via libc-commits
libc-commits at lists.llvm.org
Mon Feb 7 08:58:51 PST 2022
avieira added a comment.
In D114637#3231652 <https://reviews.llvm.org/D114637#3231652>, @gchatelet wrote:
> In D114637#3199471 <https://reviews.llvm.org/D114637#3199471>, @avieira wrote:
>> Hi @gchatelet,
>> I'm working on an aarch64 optimised version and I came across something that might be of use to you too. I found that the Repeated implementation of Move was yielding sub-optimal code in the large loop, it would load a _64 element in reverse (last 32-bytes first), I believe this was a side-effect of how it was stacking the loads and stores in opposite order like:
>> Load (src)
>> Load (src + 8)
>> Load (src + 16)
>> Load (src + 32)
>> Store (src + 32)
>> Store (src + 16)
> Do you have an idea of why this is yielding suboptimal results?
> In the code I generated for x86-64, using this pyramid shape offset pattern reduced the number of instructions (the compiler could outline the last store across different functions).
> I'm not sure this translated into better performance though, only slightly smaller function size.
>> I found that changing the implementation of the Repeated Move to a for-loop of loads followed by a for-loop of stores from 0 to ElementCount solved it and gave me a speed up on larger memmoves.
> Could you share the resulting asm?
Sorry I hadn't seen this earlier, notification must have fallen through the cracks, but I'll share here the same I shared with you, I won't share the full memmove function as that is a lot of code, but basically the difference in codegen between before and your change is that in the forward and backward loops the stores go from:
40fb14: ad011da6 stp q6, q7, [x13, #32]
40fb18: ad0015a4 stp q4, q5, [x13]
40fb14: ad0015a4 stp q4, q5, [x13]
40fb18: ad011da6 stp q6, q7, [x13, #32]
And the latter is preferred on AArch64.
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
More information about the libc-commits