[PATCH] D55263: [CodeGen][ExpandMemcmp] Add an option for allowing overlapping loads.
Clement Courbet via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Dec 6 03:58:45 PST 2018
courbet marked an inline comment as done.
courbet added a comment.
In D55263#1321181 <https://reviews.llvm.org/D55263#1321181>, @andreadb wrote:
> Instructions with a REP prefix incur in a significant setup overhead. So, they are definitely to avoid if the repeat count is small. Even on larger data sets (At least on AMD) a loop of vector operations would still provide a better throughput than REP MOVS/CMPQ.
Intel has the same issue actually. IIRC we only lower to repmovs for very large sizes, when alwaysinline is true.
> repe cmpsb (memcmp) and repne scasb (memchr) run at worse than 2 or 1 cycle per compare (respectively) on mainstream Intel CPUs. The microcode simply loops 1 byte at a time. See Agner Fog's instruction tables (https://agner.org/optimize/) AFAIK there's no plan to change this in future CPUs.
Thanks for the info.
================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:2902
Options.LoadSizes.push_back(1);
+ // All GPR loads can be unaligned, and vector loads too starting form SSE2.
+ Options.AllowOverlappingLoads = true;
----------------
andreadb wrote:
> s/form/from.
>
> Strictly speaking, SSE1 provides MOVUPS for unaligned vector FP loads.
> However, it gets problematic when comparing vectors for equality; using CMPEQPS is not going to work as expected for the case where one of the operands is NaN.
> One of your tests shows that the expansion is effectively disabled if the target is SSE but not SSE2. However, as John wrote, I don't see where we check for feature SSE2 is done...
If you look a few lines above, we only allow expansion with 16 bytes to happen if `ST->hasSSE2()`.
Repository:
rL LLVM
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D55263/new/
https://reviews.llvm.org/D55263
More information about the llvm-commits
mailing list