[PATCH] D35750: [x86] Teach the x86 backend about general fast rep+movs and rep+stos features of modern x86 CPUs, and use this feature to drastically reduce the number of places we actually emit memset and memcpy library calls.

Rafael Avila de Espindola via llvm-commits llvm-commits at lists.llvm.org
Mon Jul 24 12:59:40 PDT 2017


A crazy idea: can we make the calls cheaper?

GCC has a -fno-plt. Could we have that and default to it for these
functions?

Cheers,
Rafael

Chandler Carruth via llvm-commits <llvm-commits at lists.llvm.org> writes:

> Benchmarks so far aren't looking good.
>
> It's frustrating because as you say w/o profile information calling the
> library functions is a pretty good bet. Unfortunately, because we
> aggressive canonicalize loops into memcpy (and memset) we have a problem
> when the loop was actually the optimal emitted code.
>
> I'm going to play with some options and see if there is a code sequence
> which is small enough to inline and has better performance tradeoffs than
> either the libcall or rep+movs pattern.
>
>
> FWIW, some basic facts that my benchmark has already uncovered:
> - Even as late as sandybridge, the performance of rep+movsb (emphasis on
> *b* there) is pretty terrible. rep+movsq is more plausible
> - Haswell sees rep+mosvb slightly faster than rep+movsq, but not by much
> - All of these are slower than a libcall on both sandybridge and haswell
> I've tried so far on everything but long (over 4k) sequences.
> - rep+movsq tends to be the fastest over 4k on both sandybridge and haswell
> - the only thing I've tried so far that makes my particular collection of
> benchmarks that are particularly impacted by this faster is an 8-byte
> loop... i didn't expect this to be faster than rep+movsq, but there you
> go....
>   - it's worth noting that for at least some benchmarks, this is
> significant. the one I'm working on has perf hit between 5% and 50%
> depending on dataset for an 8-byte loop vs. memset libcall.
>
> Still lots more measurements to do before any definite conclusions. I
> remain somewhat concerned about injecting PLT-based libcalls into so many
> places. LLVM is generating a *lot* of these.
>
> On Sat, Jul 22, 2017 at 12:04 AM David Li via Phabricator <
> reviews at reviews.llvm.org> wrote:
>
>> davidxl added a comment.
>>
>> Do you have more benchmark numbers? For reference, here is GCC does (for
>> sandybridge and above) for mempcy when size profile data is available:
>>
>> 1. when the size is <= 24, use 8 byte copy loop or straightline code.
>> 2. when size is is between 24 and 128, use rep movsq
>> 3. when size is b above that, use libcall
>>
>> It is an interesting idea to consider PLT overhead here, but is there a
>> better way to model the cost?
>>
>> I worry that without profile data, blindly using rep movsb may be
>> problematic. Teresa has a pending patch to make use value profile
>> information.  Without profile, if size matters, perhaps we can guard the
>> expansion sequence with size checks.
>>
>> Also if the root cause
>>
>>
>> https://reviews.llvm.org/D35750
>>
>>
>>
>>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits


More information about the llvm-commits mailing list