[PATCH] D55263: [CodeGen][ExpandMemcmp] Add an option for allowing overlapping loads.

Thu Dec 6 03:58:45 PST 2018

courbet marked an inline comment as done.
courbet added a comment.

In D55263#1321181 <https://reviews.llvm.org/D55263#1321181>, @andreadb wrote:

> Instructions with a REP prefix incur in a significant setup overhead. So, they are definitely to avoid if the repeat count is small. Even on larger data sets (At least on AMD) a loop of vector operations would still provide a better throughput than REP MOVS/CMPQ.

Intel has the same issue actually. IIRC we only lower to repmovs for very large sizes, when alwaysinline is true.

> repe cmpsb (memcmp) and repne scasb (memchr) run at worse than 2 or 1 cycle per compare (respectively) on mainstream Intel CPUs. The microcode simply loops 1 byte at a time. See Agner Fog's instruction tables (https://agner.org/optimize/) AFAIK there's no plan to change this in future CPUs.

Thanks for the info.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:2902
     Options.LoadSizes.push_back(1);
+    // All GPR loads can be unaligned, and vector loads too starting form SSE2.
+    Options.AllowOverlappingLoads = true;
----------------
andreadb wrote:
> s/form/from.
> 
> Strictly speaking, SSE1 provides MOVUPS for unaligned vector FP loads.
> However, it gets problematic when comparing vectors for equality; using CMPEQPS is not going to work as expected for the case where one of the operands is NaN.
> One of your tests shows that the expansion is effectively disabled if the target is SSE but not SSE2. However, as John wrote, I don't see where we check for feature SSE2 is done... 
If you look  a few lines above, we only allow expansion with 16 bytes to happen if `ST->hasSSE2()`. 

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55263/new/

https://reviews.llvm.org/D55263