[libc-commits] [PATCH] D124958: [libc] Align functions to 64B to lower benchmarking variance
Guillaume Chatelet via Phabricator via libc-commits
libc-commits at lists.llvm.org
Thu May 5 01:59:47 PDT 2022
gchatelet added a comment.
In D124958#3492446 <https://reviews.llvm.org/D124958#3492446>, @sivachandra wrote:
> I have a few questions about this patch:
> 1. Do you want all libc functions to be aligned to 64 bytes, or all public functions to be aligned to 64 bytes? Or, do you just want the memory functions to be aligned to 64 bytes?
This was my original answer
> We've measured a significant (+30%) swing in performance in microbenchmarks for several x86 microarchitectures (Intel Haswell, Intel Skylake, AMD Rome) and for a variety of memory functions: read only functions (e.g. `memcmp`, `bcmp`), write only functions (e.g. `memset`, `bzero`) and read/write functions (e.g. `memmove`, `memcpy`).
> So I'd be inclined to think that at least all functions touching memory are subject to this swing for x86.
> Now, I don't see a specific link between alignment of the code and performance of read/write operations so I'd be tempted to conclude that this behaviour is generalizable to all functions.
> Do you want me to gather evidence of this behaviour for other functions before moving forward with this patch?
> Also maybe we can lower the alignment requirement to 32B, by default it is 16B.
but taking a step back here, it seems that the swing happens mostly for distributions that exercise large sizes (namely `uniform 384 to 4096`).
This corresponds to code running in a loop and using vector instructions.
On x86 these instructions take many bytes to encode and the CPU's frontend can only decode up to 16B per cycle (32B for Rome AFAIR).
Usually the decoded instructions are cached to prevent tight loops to be frontend bounds but we know that under certain circumstances the cache is evicted, leading to decoding to occur again.
We also know that the caching is based on instruction addresses so aligning the function may just - by chance - help with this.
I'd need to do more tests to check this assumption though.
> 2. What effect should I see with and without this patch? For example, with `memcpy`, I notice that it gets an address of 0 with or without this patch.
You cannot see the effect of this patch by looking at the asm in the generated `.o` or `.a`, the effect is only visible once linked in the final binary.
You can witness it by using objdump though. The `-h` option drops the section data where you can see the required alignment `2**6` (64), without this patch the alignment is `2**4` (16).
% objdump -h ~/llvm-project_rel_compiled-with-clang/projects/libc/src/string/CMakeFiles/libc.src.string.memcpy_opt_host.__internal__.dir/memcpy.cpp.o
/redacted/llvm-project_rel_compiled-with-clang/projects/libc/src/string/CMakeFiles/libc.src.string.memcpy_opt_host.__internal__.dir/memcpy.cpp.o: file format elf64-x86-64
Idx Name Size VMA LMA File off Algn
0 .text 00000000 0000000000000000 0000000000000000 00000040 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 .text._ZN11__llvm_libc6memcpyEPvPKvm 00000183 0000000000000000 0000000000000000 00000040 2**6
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
2 .rodata._ZN11__llvm_libc6memcpyEPvPKvm 00000014 0000000000000000 0000000000000000 000001c4 2**2
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
3 .comment 00000067 0000000000000000 0000000000000000 000001d8 2**0
4 .note.GNU-stack 00000000 0000000000000000 0000000000000000 0000023f 2**0
5 .llvm_addrsig 00000000 0000000000000000 0000000000000000 00000348 2**0
CONTENTS, READONLY, EXCLUDE
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
More information about the libc-commits