[PATCH] D35750: [x86] Teach the x86 backend about general fast rep+movs and rep+stos features of modern x86 CPUs, and use this feature to drastically reduce the number of places we actually emit memset and memcpy library calls.
Serge Pavlov via llvm-commits
llvm-commits at lists.llvm.org
Tue Jul 25 06:40:14 PDT 2017
According to Intel® 64 and IA-32 Architectures Optimization Reference
On Intel microarchitecture code name Ivy Bridge, a REP MOVSB implementation
of memcpy can achieve throughput at slightly better than the 128-bit SIMD
implementation when copying thousands of bytes.
So this implementation must be profitable.
2017-07-25 2:59 GMT+07:00 Rafael Avila de Espindola via llvm-commits <
llvm-commits at lists.llvm.org>:
> A crazy idea: can we make the calls cheaper?
> GCC has a -fno-plt. Could we have that and default to it for these
> Chandler Carruth via llvm-commits <llvm-commits at lists.llvm.org> writes:
> > Benchmarks so far aren't looking good.
> > It's frustrating because as you say w/o profile information calling the
> > library functions is a pretty good bet. Unfortunately, because we
> > aggressive canonicalize loops into memcpy (and memset) we have a problem
> > when the loop was actually the optimal emitted code.
> > I'm going to play with some options and see if there is a code sequence
> > which is small enough to inline and has better performance tradeoffs than
> > either the libcall or rep+movs pattern.
> > FWIW, some basic facts that my benchmark has already uncovered:
> > - Even as late as sandybridge, the performance of rep+movsb (emphasis on
> > *b* there) is pretty terrible. rep+movsq is more plausible
> > - Haswell sees rep+mosvb slightly faster than rep+movsq, but not by much
> > - All of these are slower than a libcall on both sandybridge and haswell
> > I've tried so far on everything but long (over 4k) sequences.
> > - rep+movsq tends to be the fastest over 4k on both sandybridge and
> > - the only thing I've tried so far that makes my particular collection of
> > benchmarks that are particularly impacted by this faster is an 8-byte
> > loop... i didn't expect this to be faster than rep+movsq, but there you
> > go....
> > - it's worth noting that for at least some benchmarks, this is
> > significant. the one I'm working on has perf hit between 5% and 50%
> > depending on dataset for an 8-byte loop vs. memset libcall.
> > Still lots more measurements to do before any definite conclusions. I
> > remain somewhat concerned about injecting PLT-based libcalls into so many
> > places. LLVM is generating a *lot* of these.
> > On Sat, Jul 22, 2017 at 12:04 AM David Li via Phabricator <
> > reviews at reviews.llvm.org> wrote:
> >> davidxl added a comment.
> >> Do you have more benchmark numbers? For reference, here is GCC does (for
> >> sandybridge and above) for mempcy when size profile data is available:
> >> 1. when the size is <= 24, use 8 byte copy loop or straightline code.
> >> 2. when size is is between 24 and 128, use rep movsq
> >> 3. when size is b above that, use libcall
> >> It is an interesting idea to consider PLT overhead here, but is there a
> >> better way to model the cost?
> >> I worry that without profile data, blindly using rep movsb may be
> >> problematic. Teresa has a pending patch to make use value profile
> >> information. Without profile, if size matters, perhaps we can guard the
> >> expansion sequence with size checks.
> >> Also if the root cause
> >> https://reviews.llvm.org/D35750
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-commits