[PATCH] D35750: [x86] Teach the x86 backend about general fast rep+movs and rep+stos features of modern x86 CPUs, and use this feature to drastically reduce the number of places we actually emit memset and memcpy library calls.

Sat Jul 22 00:04:09 PDT 2017

davidxl added a comment.

Do you have more benchmark numbers? For reference, here is GCC does (for sandybridge and above) for mempcy when size profile data is available:

1. when the size is <= 24, use 8 byte copy loop or straightline code.
2. when size is is between 24 and 128, use rep movsq
3. when size is b above that, use libcall

It is an interesting idea to consider PLT overhead here, but is there a better way to model the cost?

I worry that without profile data, blindly using rep movsb may be problematic. Teresa has a pending patch to make use value profile information.  Without profile, if size matters, perhaps we can guard the expansion sequence with size checks.

Also if the root cause

https://reviews.llvm.org/D35750