[libc-commits] [PATCH] D92236: [LIBC] Add optimized memcpy routine for AArch64

Thu Dec 17 08:42:04 PST 2020

gchatelet added inline comments.

================
Comment at: libc/src/string/aarch64/memcpy.cpp:71-103
+  // Large copy
+  // Copy 16 bytes and then align src to 16-byte alignment.
+  CopyBlock<16>(dst, src);
+
+  // Align to either source or destination depending on target.
+  // Default aligns to source, define 'ALIGN_DST' to align to destination.
+#if ALIGN_DST
----------------
avieira wrote:
> gchatelet wrote:
> > avieira wrote:
> > > gchatelet wrote:
> > > > I think the code here could be replaced with a call to `CopyAlignedBlocks<64>` from `src/string/memory_utils/memcpy_utils.h` which essentially aligns the destination pointers, copies blocks of 64 bytes and handles the trailing bytes with an overlapping copy.
> > > > 
> > > > If it's important that the alignment is 16B instead of 64B I can change the implementation but since the code is heavily tested I'd be in favor of reusing it as much as possible.
> > > > 
> > > > Note: I believe that the compiler will generate the same code whether using two `CopyBlock<32>` or a single `CopyBlock<64>`.
> > > I am looking at this one, the reason I wrote out this code was because we want to align for source, not destination. I will have a look at the implementation of CopyAlignedBlocks, maybe the answer is to further template this to allow a switch between src and dst alignment?
> > > 
> > > I am also looking at a potential codegen difference between 2* CopyBlock<32> and CopyBlock<64>, will get back to you next week on this.
> > > I am looking at this one, the reason I wrote out this code was because we want to align for source, not destination. I will have a look at the implementation of CopyAlignedBlocks, maybe the answer is to further template this to allow a switch between src and dst alignment?
> > 
> > Rereading this comment, it seems there's a mistake in our code, it should be source aligned as you suggest and not destination aligned.
> > I've sent D93457 to fix it.
> > 
> I see, I was under the impression some targets might prefer destination aligning, but I have no problems with changing it to source aligning.
> 
> I have been playing around with replacing the 2x CopyBlock<32>'s in this part with CopyBlock<64>. I found that earlier compilers generated better code for 2x32 vs 1x64 but it seems a later clang generates the same for both.
> 
> I'll benchmark the new CopyAlignedBlocks on Neoverse-N1 and come back to you. A first difference that pops out is that I was using 64-byte copies in the loop and after loop, but aligning to 16-bytes as we found that to be sufficient. This means we only need to do one 16-byte unaligned load. Whereas CopyAlignedBlocks requires the alignment and kBlockSize to be the same. Do you think there might be room to change this?
> 
> Another potential point of improvement is to allow for better interleaving of loads and stores, AOR (in assembly) does the loads outside the loop, then stores, increments and loads inside the loop, with a post-loop store. I'll consider the changes above and continue to benchmark to see if there is much difference between the two.
> I see, I was under the impression some targets might prefer destination aligning, but I have no problems with changing it to source aligning.

That might happen and we can add a template parameter to the function if it comes up.

> I have been playing around with replacing the 2x CopyBlock<32>'s in this part with CopyBlock<64>. I found that earlier compilers generated better code for 2x32 vs 1x64 but it seems a later clang generates the same for both.

👍 

> I'll benchmark the new CopyAlignedBlocks on Neoverse-N1 and come back to you. A first difference that pops out is that I was using 64-byte copies in the loop and after loop, but aligning to 16-bytes as we found that to be sufficient. This means we only need to do one 16-byte unaligned load. Whereas CopyAlignedBlocks requires the alignment and kBlockSize to be the same. Do you think there might be room to change this?

Definitely, nothing is carved in stone, we can adapt the framework as needed.

> Another potential point of improvement is to allow for better interleaving of loads and stores, AOR (in assembly) does the loads outside the loop, then stores, increments and loads inside the loop, with a post-loop store. I'll consider the changes above and continue to benchmark to see if there is much difference between the two.

Interesting! I wonder if this translates to gains on x86 as well.
I'll make tests as well on my side.

BTW the benchmarking framework has been updated in D93210. Let me know if you need help to use it.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D92236/new/

https://reviews.llvm.org/D92236