<div dir="ltr"><div dir="ltr">> Rather than tune for individual microarchitecture variations, we would</div>> prefer to leverage on fast string operations provided by the ISA (for<br>> example “rep movsb” on x86). This allows us to leverage wider data<br>> paths over time, without having to build custom dispatch logic which<br>> carries its own overheads.<div dir="ltr"><br></div><div>So you are going to write memcpy in C++ loops and assume LLVM will compile that to "rep movsb" unconditionally for all x86 architectures? Is this what LLVM does today with such loops, or is LLVM expected to change its codegen to support this?</div><div><br></div><div>What performance loss do you consider acceptable for large data copies by virtue of disallowing custom dispatch logic?</div><div><br></div><div>Thanks,</div><div><br></div><div>Jeff</div><div dir="ltr"><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 10, 2020 at 9:57 AM Chris Kennelly via libc-dev <<a href="mailto:libc-dev@lists.llvm.org">libc-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">We are planning on landing implementations for commonly used mem<br>
functions (memcpy, memset, memcmp, and bcmp).<br>
<a href="https://reviews.llvm.org/D72516" rel="noreferrer" target="_blank">https://reviews.llvm.org/D72516</a> by Guillaume adds several initial<br>
benchmarks for these functions.<br>
<br>
Design Philosophy<br>
<br>
These functions use a significant amount of compute in typical<br>
programs. In “Profiling a warehouse-scale computer” [S. Kanev, et.<br>
al., ISCA ‘15], the authors observed ~5% of processor cycles being<br>
spent in these functions, making their performance critical. This has<br>
influenced our design in several ways:<br>
<br>
* C++ implementation: We developed our implementations in C++. This<br>
makes understanding and reasoning about the correctness of the<br>
implementation far easier than handcrafted assembly. By leaving<br>
codegen to the compiler, we can leverage existing compiler features<br>
(scheduling, FDO, etc.) that exist today.<br>
<br>
* Small Size Distributions: Using a profiler to observe size<br>
distributions for calls into libc functions, we found most operations<br>
are small. This has made it critical to optimize short operation<br>
latency, while preserving large operation throughput.<br>
<br>
For `memcpy`, 96% of sizes are <= 128 bytes. 99% are <= 1024 bytes.<br>
For `memset`, 91% of sizes are <= 128 bytes. 99.9% are <= 1024 bytes.<br>
For `memcmp`, 99.5% of sizes are <= 128 bytes, ~100% are <= 1024 bytes.<br>
<br>
In the rare cases where we have found exceptions to these<br>
distributions, we’ve found efficiency bugs such as a spuriously<br>
copied, large data structure.<br>
<br>
* Small Code Footprint: Our implementations consist of concise<br>
patterns for working with chunks of data. While further<br>
specialization can produce better results on microbenchmarks, we did<br>
not see these wins materialize on macrobenchmarks measuring<br>
application productivity.<br>
<br>
* Avoiding Runtime Dispatch: Our implementation leverages the<br>
compiler to choose appropriately wide instructions available<br>
statically at compile time, but does not use runtime dispatch to<br>
switch between implementations (say 128-/256-/512-bit vector widths).<br>
<br>
In our experience, the overhead of dispatching between implementations<br>
(via branches or the PLT) overwhelmed the performance benefits on<br>
macrobenchmark performance by ~0.5-1%.<br>
<br>
Rather than tune for individual microarchitecture variations, we would<br>
prefer to leverage on fast string operations provided by the ISA (for<br>
example “rep movsb” on x86). This allows us to leverage wider data<br>
paths over time, without having to build custom dispatch logic which<br>
carries its own overheads.<br>
<br>
* Compiler/Library Codesign Opportunities: LLVM can lower calls for<br>
`memcmp() == 0` to a call to `bcmp` (<a href="https://reviews.llvm.org/D56593" rel="noreferrer" target="_blank">https://reviews.llvm.org/D56593</a>).<br>
Equality comparison can be more aggressively optimized and implement<br>
specializations for early mismatch detection.<br>
<br>
For hot callsites, we can inline calls, leveraging locally available<br>
information for size and alignment.<br>
<br>
We see hardware support for these operations as the future. The<br>
implementation of these instructions can access processor features<br>
that aren’t available through the ISA and that can vary across<br>
processor generations. When available, we would plan to fully inline<br>
the implementations of these functions with a short, fixed instruction<br>
sequence that provides the maximum performance available from the<br>
hardware.<br>
<br>
Thanks,<br>
Chris Kennelly<br>
_______________________________________________<br>
libc-dev mailing list<br>
<a href="mailto:libc-dev@lists.llvm.org" target="_blank">libc-dev@lists.llvm.org</a><br>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/libc-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/libc-dev</a><br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature">Jeff Hammond<br><a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a><br><a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a></div></div>