[PATCH] D35750: [x86] Teach the x86 backend about general fast rep+movs and rep+stos features of modern x86 CPUs, and use this feature to drastically reduce the number of places we actually emit memset and memcpy library calls.

Sat Jul 22 00:43:55 PDT 2017

Benchmarks so far aren't looking good.

It's frustrating because as you say w/o profile information calling the
library functions is a pretty good bet. Unfortunately, because we
aggressive canonicalize loops into memcpy (and memset) we have a problem
when the loop was actually the optimal emitted code.

I'm going to play with some options and see if there is a code sequence
which is small enough to inline and has better performance tradeoffs than
either the libcall or rep+movs pattern.

FWIW, some basic facts that my benchmark has already uncovered:
- Even as late as sandybridge, the performance of rep+movsb (emphasis on
*b* there) is pretty terrible. rep+movsq is more plausible
- Haswell sees rep+mosvb slightly faster than rep+movsq, but not by much
- All of these are slower than a libcall on both sandybridge and haswell
I've tried so far on everything but long (over 4k) sequences.
- rep+movsq tends to be the fastest over 4k on both sandybridge and haswell
- the only thing I've tried so far that makes my particular collection of
benchmarks that are particularly impacted by this faster is an 8-byte
loop... i didn't expect this to be faster than rep+movsq, but there you
go....
  - it's worth noting that for at least some benchmarks, this is
significant. the one I'm working on has perf hit between 5% and 50%
depending on dataset for an 8-byte loop vs. memset libcall.

Still lots more measurements to do before any definite conclusions. I
remain somewhat concerned about injecting PLT-based libcalls into so many
places. LLVM is generating a *lot* of these.

On Sat, Jul 22, 2017 at 12:04 AM David Li via Phabricator <
reviews at reviews.llvm.org> wrote:

> davidxl added a comment.
>
> Do you have more benchmark numbers? For reference, here is GCC does (for
> sandybridge and above) for mempcy when size profile data is available:
>
> 1. when the size is <= 24, use 8 byte copy loop or straightline code.
> 2. when size is is between 24 and 128, use rep movsq
> 3. when size is b above that, use libcall
>
> It is an interesting idea to consider PLT overhead here, but is there a
> better way to model the cost?
>
> I worry that without profile data, blindly using rep movsb may be
> problematic. Teresa has a pending patch to make use value profile
> information.  Without profile, if size matters, perhaps we can guard the
> expansion sequence with size checks.
>
> Also if the root cause
>
>
> https://reviews.llvm.org/D35750
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170722/14697880/attachment.html>