[PATCH] D35750: [x86] Teach the x86 backend about general fast rep+movs and rep+stos features of modern x86 CPUs, and use this feature to drastically reduce the number of places we actually emit memset and memcpy library calls.

Sat Jul 22 11:24:57 PDT 2017

On Sat, Jul 22, 2017 at 12:43 AM, Chandler Carruth <chandlerc at gmail.com>
wrote:

> Benchmarks so far aren't looking good.
>
> It's frustrating because as you say w/o profile information calling the
> library functions is a pretty good bet. Unfortunately, because we
> aggressive canonicalize loops into memcpy (and memset) we have a problem
> when the loop was actually the optimal emitted code.
>
> I'm going to play with some options and see if there is a code sequence
> which is small enough to inline and has better performance tradeoffs than
> either the libcall or rep+movs pattern.
>
>
> FWIW, some basic facts that my benchmark has already uncovered:
> - Even as late as sandybridge, the performance of rep+movsb (emphasis on
> *b* there) is pretty terrible. rep+movsq is more plausible
> - Haswell sees rep+mosvb slightly faster than rep+movsq, but not by much
> - All of these are slower than a libcall on both sandybridge and haswell
> I've tried so far on everything but long (over 4k) sequences.
> - rep+movsq tends to be the fastest over 4k on both sandybridge and haswell
> - the only thing I've tried so far that makes my particular collection of
> benchmarks that are particularly impacted by this faster is an 8-byte
> loop... i didn't expect this to be faster than rep+movsq, but there you
> go....
>   - it's worth noting that for at least some benchmarks, this is
> significant. the one I'm working on has perf hit between 5% and 50%
> depending on dataset for an 8-byte loop vs. memset libcall.
>
> Still lots more measurements to do before any definite conclusions. I
> remain somewhat concerned about injecting PLT-based libcalls into so many
> places. LLVM is generating a *lot* of these.
>

Tuning string-op lowering without profile can be tricky. For now, I suggest
we tame the optimization that synthesize memcpy/memset:
1) use average trip count from profile data if available;
2) without profile data make the transformation guarded with trip count
check
3) consider the PLT overhead there is also reasonable.

David

>
> On Sat, Jul 22, 2017 at 12:04 AM David Li via Phabricator <
> reviews at reviews.llvm.org> wrote:
>
>> davidxl added a comment.
>>
>> Do you have more benchmark numbers? For reference, here is GCC does (for
>> sandybridge and above) for mempcy when size profile data is available:
>>
>> 1. when the size is <= 24, use 8 byte copy loop or straightline code.
>> 2. when size is is between 24 and 128, use rep movsq
>> 3. when size is b above that, use libcall
>>
>> It is an interesting idea to consider PLT overhead here, but is there a
>> better way to model the cost?
>>
>> I worry that without profile data, blindly using rep movsb may be
>> problematic. Teresa has a pending patch to make use value profile
>> information.  Without profile, if size matters, perhaps we can guard the
>> expansion sequence with size checks.
>>
>> Also if the root cause
>>
>>
>> https://reviews.llvm.org/D35750
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170722/47202166/attachment.html>