[PATCH] D35750: [x86] Teach the x86 backend about general fast rep+movs and rep+stos features of modern x86 CPUs, and use this feature to drastically reduce the number of places we actually emit memset and memcpy library calls.

Tue Jul 25 06:40:14 PDT 2017

According to Intel® 64 and IA-32 Architectures Optimization Reference
Manual (
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
) 11.16.3.2:

On Intel microarchitecture code name Ivy Bridge, a REP MOVSB implementation
of memcpy can achieve throughput at slightly better than the 128-bit SIMD
implementation when copying thousands of bytes.

So this implementation must be profitable.

Thanks,
--Serge

2017-07-25 2:59 GMT+07:00 Rafael Avila de Espindola via llvm-commits <
llvm-commits at lists.llvm.org>:

> A crazy idea: can we make the calls cheaper?
>
> GCC has a -fno-plt. Could we have that and default to it for these
> functions?
>
> Cheers,
> Rafael
>
> Chandler Carruth via llvm-commits <llvm-commits at lists.llvm.org> writes:
>
> > Benchmarks so far aren't looking good.
> >
> > It's frustrating because as you say w/o profile information calling the
> > library functions is a pretty good bet. Unfortunately, because we
> > aggressive canonicalize loops into memcpy (and memset) we have a problem
> > when the loop was actually the optimal emitted code.
> >
> > I'm going to play with some options and see if there is a code sequence
> > which is small enough to inline and has better performance tradeoffs than
> > either the libcall or rep+movs pattern.
> >
> >
> > FWIW, some basic facts that my benchmark has already uncovered:
> > - Even as late as sandybridge, the performance of rep+movsb (emphasis on
> > *b* there) is pretty terrible. rep+movsq is more plausible
> > - Haswell sees rep+mosvb slightly faster than rep+movsq, but not by much
> > - All of these are slower than a libcall on both sandybridge and haswell
> > I've tried so far on everything but long (over 4k) sequences.
> > - rep+movsq tends to be the fastest over 4k on both sandybridge and
> haswell
> > - the only thing I've tried so far that makes my particular collection of
> > benchmarks that are particularly impacted by this faster is an 8-byte
> > loop... i didn't expect this to be faster than rep+movsq, but there you
> > go....
> >   - it's worth noting that for at least some benchmarks, this is
> > significant. the one I'm working on has perf hit between 5% and 50%
> > depending on dataset for an 8-byte loop vs. memset libcall.
> >
> > Still lots more measurements to do before any definite conclusions. I
> > remain somewhat concerned about injecting PLT-based libcalls into so many
> > places. LLVM is generating a *lot* of these.
> >
> > On Sat, Jul 22, 2017 at 12:04 AM David Li via Phabricator <
> > reviews at reviews.llvm.org> wrote:
> >
> >> davidxl added a comment.
> >>
> >> Do you have more benchmark numbers? For reference, here is GCC does (for
> >> sandybridge and above) for mempcy when size profile data is available:
> >>
> >> 1. when the size is <= 24, use 8 byte copy loop or straightline code.
> >> 2. when size is is between 24 and 128, use rep movsq
> >> 3. when size is b above that, use libcall
> >>
> >> It is an interesting idea to consider PLT overhead here, but is there a
> >> better way to model the cost?
> >>
> >> I worry that without profile data, blindly using rep movsb may be
> >> problematic. Teresa has a pending patch to make use value profile
> >> information.  Without profile, if size matters, perhaps we can guard the
> >> expansion sequence with size checks.
> >>
> >> Also if the root cause
> >>
> >>
> >> https://reviews.llvm.org/D35750
> >>
> >>
> >>
> >>
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170725/256cc786/attachment.html>