[PATCH] D55263: [CodeGen][ExpandMemcmp] Add an option for allowing overlapping loads.

Thu Dec 6 03:31:48 PST 2018

pcordes added a comment.

In D55263#1320582 <https://reviews.llvm.org/D55263#1320582>, @JohnReagan wrote:

> One of my coworkers did an informal test last year and saw that newer Intel CPUs optimization of REP-string-op-instruction was faster than using SSE2 (he used large data sizes, not anything in the shorter ranges this patch deals with).  Is that something that should be looked at?  (or has somebody done that examination already)

Only rep movs and rep stos are fast (memcpy and memset) on current Intel and AMD.

`repe cmpsb` (memcmp) and `repne scasb` (memchr) run at worse than 2 or 1 cycle per compare (respectively) on mainstream Intel CPUs.  The microcode simply loops 1 byte at a time.  See Agner Fog's instruction tables (https://agner.org/optimize/)

AFAIK there's no plan to change this in future CPUs.

`rep stos/movs` might become useful even for short copies in I think IceLake with the expected short-rep feature, but I haven't heard of any plan to have optimized microcode for the compare functions with data-dependent stop conditions.

And yes, on CPUs with 256-bit or 512-bit data paths internally, `rep stos/movs` can take advantage of them and be faster than SSE2.  (Close to with AVX or AVX512: a vector loop is often still best even on CPUs with the ERMSB feature.)  See https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55263/new/

https://reviews.llvm.org/D55263