[PATCH] D35750: [x86] Teach the x86 backend about general fast rep+movs and rep+stos features of modern x86 CPUs, and use this feature to drastically reduce the number of places we actually emit memset and memcpy library calls.
David Li via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Sat Jul 22 00:04:09 PDT 2017
davidxl added a comment.
Do you have more benchmark numbers? For reference, here is GCC does (for sandybridge and above) for mempcy when size profile data is available:
1. when the size is <= 24, use 8 byte copy loop or straightline code.
2. when size is is between 24 and 128, use rep movsq
3. when size is b above that, use libcall
It is an interesting idea to consider PLT overhead here, but is there a better way to model the cost?
I worry that without profile data, blindly using rep movsb may be problematic. Teresa has a pending patch to make use value profile information. Without profile, if size matters, perhaps we can guard the expansion sequence with size checks.
Also if the root cause
https://reviews.llvm.org/D35750
More information about the llvm-commits
mailing list