<div dir="ltr">According to Intel® 64 and IA-32 Architectures Optimization Reference Manual (<a href="https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf">https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf</a>) <a href="http://11.16.3.2">11.16.3.2</a>:<div><br></div><div>On Intel microarchitecture code name Ivy Bridge, a REP MOVSB implementation of memcpy can achieve

throughput at slightly better than the 128-bit SIMD implementation when copying thousands of bytes. <br></div><div><br></div><div>So this implementation must be profitable.</div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature" data-smartmail="gmail_signature">Thanks,<br>--Serge<br></div></div>

<br><div class="gmail_quote">2017-07-25 2:59 GMT+07:00 Rafael Avila de Espindola via llvm-commits <span dir="ltr"><<a href="mailto:llvm-commits@lists.llvm.org" target="_blank">llvm-commits@lists.llvm.org</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">A crazy idea: can we make the calls cheaper?<br>

<br>

GCC has a -fno-plt. Could we have that and default to it for these<br>

functions?<br>

<br>

Cheers,<br>

Rafael<br>

<div class="HOEnZb"><div class="h5"><br>

Chandler Carruth via llvm-commits <<a href="mailto:llvm-commits@lists.llvm.org">llvm-commits@lists.llvm.org</a>> writes:<br>

<br>

> Benchmarks so far aren't looking good.<br>

><br>

> It's frustrating because as you say w/o profile information calling the<br>

> library functions is a pretty good bet. Unfortunately, because we<br>

> aggressive canonicalize loops into memcpy (and memset) we have a problem<br>

> when the loop was actually the optimal emitted code.<br>

><br>

> I'm going to play with some options and see if there is a code sequence<br>

> which is small enough to inline and has better performance tradeoffs than<br>

> either the libcall or rep+movs pattern.<br>

><br>

><br>

> FWIW, some basic facts that my benchmark has already uncovered:<br>

> - Even as late as sandybridge, the performance of rep+movsb (emphasis on<br>

> *b* there) is pretty terrible. rep+movsq is more plausible<br>

> - Haswell sees rep+mosvb slightly faster than rep+movsq, but not by much<br>

> - All of these are slower than a libcall on both sandybridge and haswell<br>

> I've tried so far on everything but long (over 4k) sequences.<br>

> - rep+movsq tends to be the fastest over 4k on both sandybridge and haswell<br>

> - the only thing I've tried so far that makes my particular collection of<br>

> benchmarks that are particularly impacted by this faster is an 8-byte<br>

> loop... i didn't expect this to be faster than rep+movsq, but there you<br>

> go....<br>

>   - it's worth noting that for at least some benchmarks, this is<br>

> significant. the one I'm working on has perf hit between 5% and 50%<br>

> depending on dataset for an 8-byte loop vs. memset libcall.<br>

><br>

> Still lots more measurements to do before any definite conclusions. I<br>

> remain somewhat concerned about injecting PLT-based libcalls into so many<br>

> places. LLVM is generating a *lot* of these.<br>

><br>

> On Sat, Jul 22, 2017 at 12:04 AM David Li via Phabricator <<br>

> <a href="mailto:reviews@reviews.llvm.org">reviews@reviews.llvm.org</a>> wrote:<br>

><br>

>> davidxl added a comment.<br>

>><br>

>> Do you have more benchmark numbers? For reference, here is GCC does (for<br>

>> sandybridge and above) for mempcy when size profile data is available:<br>

>><br>

>> 1. when the size is <= 24, use 8 byte copy loop or straightline code.<br>

>> 2. when size is is between 24 and 128, use rep movsq<br>

>> 3. when size is b above that, use libcall<br>

>><br>

>> It is an interesting idea to consider PLT overhead here, but is there a<br>

>> better way to model the cost?<br>

>><br>

>> I worry that without profile data, blindly using rep movsb may be<br>

>> problematic. Teresa has a pending patch to make use value profile<br>

>> information.  Without profile, if size matters, perhaps we can guard the<br>

>> expansion sequence with size checks.<br>

>><br>

>> Also if the root cause<br>

>><br>

>><br>

>> <a href="https://reviews.llvm.org/D35750" rel="noreferrer" target="_blank">https://reviews.llvm.org/<wbr>D35750</a><br>

>><br>

>><br>

>><br>

>><br>

</div></div><div class="HOEnZb"><div class="h5">> ______________________________<wbr>_________________<br>

> llvm-commits mailing list<br>

> <a href="mailto:llvm-commits@lists.llvm.org">llvm-commits@lists.llvm.org</a><br>

> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-commits</a><br>

______________________________<wbr>_________________<br>

llvm-commits mailing list<br>

<a href="mailto:llvm-commits@lists.llvm.org">llvm-commits@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-commits</a><br>

</div></div></blockquote></div><br></div>