[libc-commits] [PATCH] D92236: [LIBC] Add optimized memcpy routine for AArch64

Fri Jan 29 08:17:32 PST 2021

gchatelet accepted this revision.
gchatelet added a comment.
This revision is now accepted and ready to land.

Hey Andre, my apologies for the lag. Answers inlined

In D92236#2509136 <https://reviews.llvm.org/D92236#2509136>, @avieira wrote:

> So here is an updated version for an optimized memcpy routine for AArch64. This one basically uses the same as the default memcpy, but picks a different block size and alignment for copies > 128.

👍

> I also disable tail merging as I found it was leading to worse code. This new memcpy seems to show improvements accross the board for both sweep and distribution benchmarks.

Interesting! I never played with it but it might be interesting for all implementations as well. Thx for the feedback, really appreciate it.

> I am continuing to investigate a better organization of the copies smaller than 128bytes, as I had before, using the new benchmarks. Using the same code I had before I am seeing an improvement in Uniform1024 (new uniform distribution I added for sizes 0-1024), I also see an improvement in Memcpy Distributions A, M, Q and U, but a regression in B, L, S and W. For distribution D the optimized version beats the older version but shows a regression compared to the version in this patch.

Yeah the letters distributions are likely to swing since they refer to individual applications that prefer certain sizes. I would like to have an even bigger bucket (1K to 4K?) but then some processors may run out of `L1` and hitting `L2` will bias the benchmark...
Which processors did you benchmark it on? How likely it is to be representative of all aarch64 cpus? Can you leave this information in the implementation file for the record?

> I'll spend a few extra cycles trying to see if I can find a sweet spot, but I might leave it like this.

Fine by me.

> Is this OK for main?

Definitely looks good to me!

> Also I have two patches downstream for:
>
> 1. Uniform1024 distribution, an uniform distribution for sizes 0-1024
> 2. Options to define a Sweep 'min size' and 'step'.
>
> Let me know if you are interested in either of these.

Yes why not.
For 2) I believe you had to store more information in the json file and adapt the python script as well right?

================
Comment at: libc/src/string/aarch64/memcpy.cpp:34
+//   with little change on the code side.
+static void memcpy_aarch64(char *__restrict dst, const char *__restrict src,
+                           size_t count) {
----------------
Can you add a quick note on which processor(s) the implementation has been tuned and how it is representative of other aarch64 cpus? (If you happen to know)

================
Comment at: libc/src/string/aarch64/memcpy.cpp:56
+    return CopyBlockOverlap<64>(dst, src, count);
+  return CopyAlignedBlocks<64,16>(dst, src, count);
+}
----------------
Can you format the file? There's a space missing right after the coma in the template instantiation.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D92236/new/

https://reviews.llvm.org/D92236