[PATCH] D35750: [x86] Teach the x86 backend about general fast rep+movs and rep+stos features of modern x86 CPUs, and use this feature to drastically reduce the number of places we actually emit memset and memcpy library calls.

David Li via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Sat Jul 22 00:04:09 PDT 2017


davidxl added a comment.

Do you have more benchmark numbers? For reference, here is GCC does (for sandybridge and above) for mempcy when size profile data is available:

1. when the size is <= 24, use 8 byte copy loop or straightline code.
2. when size is is between 24 and 128, use rep movsq
3. when size is b above that, use libcall

It is an interesting idea to consider PLT overhead here, but is there a better way to model the cost?

I worry that without profile data, blindly using rep movsb may be problematic. Teresa has a pending patch to make use value profile information.  Without profile, if size matters, perhaps we can guard the expansion sequence with size checks.

Also if the root cause


https://reviews.llvm.org/D35750





More information about the llvm-commits mailing list